[ 
https://issues.apache.org/jira/browse/OOZIE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065004#comment-16065004
 ] 

Andras Piros commented on OOZIE-2854:
-------------------------------------

Taken over from [~pbacsko]. Walking that same path w/ following extensions.

h5. Failure injection
* subclass {{org.apache.commons.dbcp.BasicDataSource}} to have its 
{{createConnectionFactory()}} method fixed, to have 
{{Class.forName(driverClassName)}} a real effect. (See [*the fixed 
method*|https://github.com/apache/commons-dbcp/blob/DBCP_1_4_x_BRANCH/src/java/org/apache/commons/dbcp/BasicDataSource.java#L1588-L1660])
* introduce a JDBC driver extending {{com.mysql.jdbc.Driver}} that delegates 
its {{getConnection(String, Properties)}} method to a special wrapper
* let this {{java.sql.Connection}} wrapper do nothing but intercept the 
{{prepareStatement(String, int, int)}} call:
** investigate whether it's a DML statement
** investigate whether it's a statement handling an Oozie table
** if so, try to inject a {{PersistenceException}} w/ a relatively low database 
error percentage (5 %)

h5. Integration testing
* use {{FailingHSQLDBDriverWrapper}} extending {{org.hsqldb.jdbcDriver}} to 
intercept JDBC calls
* use integration test cases extending {{MiniOozieTestCase}} for following 
scenarios:
** parallel call on JPA queries can easily succeed despite of the injected 
errors
** workflows continue to pass w/o injected errors
** workflows pass w/ injected errors

h5. Functional testing
* use {{FailingMySQLDriverWrapper}} extending {{com.mysql.jdbc.Driver}} to 
intercept JDBC calls
* use following coordinator / workflow scenario:
** fired every minute
** executing for multiple days
** workflow consists of a {{<decision />}} action followed by two paths of 
consecutive {{<fs />}} and {{<shell />}} actions

> Oozie should handle transient database problems
> -----------------------------------------------
>
>                 Key: OOZIE-2854
>                 URL: https://issues.apache.org/jira/browse/OOZIE-2854
>             Project: Oozie
>          Issue Type: Improvement
>          Components: core
>            Reporter: Peter Bacsko
>            Assignee: Andras Piros
>         Attachments: OOZIE-2854-001.patch, OOZIE-2854-002.patch, 
> OOZIE-2854-003.patch, OOZIE-2854-004.patch, OOZIE-2854-005.patch, 
> OOZIE-2854-POC-001.patch
>
>
> There can be problems when Oozie cannot update the database properly. 
> Recently, we have experienced erratic behavior with two setups:
> * MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic 
> locking which might cause a transaction to rollback if there are two or more 
> parallel transaction running and one of them cannot complete because of a 
> conflict.
> * MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, 
> Oozie might get "Communications link failure" exception during the failover.
> The problem is that failed DB transactions later might cause a workflow 
> (which are started/re-started by RecoveryService) to get stuck. It's not 
> clear to us how this happens but it has to do with the fact that certain DB 
> updates are not executed.
> The solution is to use some sort of retry logic with exponential backoff if 
> the DB update fails. We could start with a 100ms wait time which is doubled 
> at every retry. The operation can be considered a failure if it still fails 
> after 10 attempts. These values could be configurable. We should discuss 
> initial values in the scope of this JIRA.
> Note that this solution is to handle *transient* failures. If the DB is down 
> for a longer period of time, we have to accept that the internal state of 
> Oozie is corrupted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to