[jira] [Comment Edited] (OOZIE-2854) Oozie should handle transient DB problems

Peter Bacsko (JIRA) Thu, 06 Apr 2017 10:59:02 -0700

    [ 
https://issues.apache.org/jira/browse/OOZIE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959448#comment-15959448
 ]


Peter Bacsko edited comment on OOZIE-2854 at 4/6/17 5:57 PM:
-------------------------------------------------------------

Hi Steven - no, I haven't tried it with Oracle RAC. In fact, I haven't tried it 
with these MySQL setups either. One of our customers experienced this issue and 
we have a MySQL expert who pointed out how Galera works and that's the reason 
why we saw exceptions in the Oozie server logs.

About the optimistic locking in Galera, here is a nice blog entry which 
explains in detail how this might cause problems (plus proposes a workaround): 
https://severalnines.com/blog/avoiding-deadlocks-galera-set-haproxy-single-node-writes-and-multi-node-reads


was (Author: pbacsko):
Hi Stephen - no, I haven't tried it with Oracle RAC. In fact, I haven't tried 
it with these MySQL setups either. One of our customers experienced this issue 
and we have a MySQL expert who pointed out how Galera works and that's the 
reason why we saw exceptions in the Oozie server logs.

About the optimistic locking in Galera, here is a nice blog entry which 
explains in detail how this might cause problems (plus proposes a workaround): 
https://severalnines.com/blog/avoiding-deadlocks-galera-set-haproxy-single-node-writes-and-multi-node-reads

> Oozie should handle transient DB problems
> -----------------------------------------
>
>                 Key: OOZIE-2854
>                 URL: https://issues.apache.org/jira/browse/OOZIE-2854
>             Project: Oozie
>          Issue Type: Improvement
>          Components: core
>            Reporter: Peter Bacsko
>            Assignee: Peter Bacsko
>
> There can be problems when Oozie cannot update the database properly. 
> Recently, we have experienced erratic behavior with two setups:
> * MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic 
> locking which might cause a transaction to rollback if there are two or more 
> parallel transaction running and one of them cannot complete because of a 
> conflict.
> * MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, 
> Oozie might get "Communications link failure" exception during the failover.
> The problem is that failed DB transactions later might cause a workflow 
> (which are started/re-started by RecoveryService) to get stuck. It's not 
> clear to us how this happens but it has to do with the fact that certain DB 
> updates are not executed.
> The solution is to use some sort of retry logic with exponential backoff if 
> the DB update fails. We could start with a 100ms wait time which is doubled 
> at every retry. The operation can be considered a failure if it still fails 
> after 10 attempts. These values could be configurable. We should discuss 
> initial values in the scope of this JIRA.
> Note that this solution is to handle *transient* failures. If the DB is down 
> for a longer period of time, we have to accept that the internal state of 
> Oozie is corrupted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (OOZIE-2854) Oozie should handle transient DB problems

Reply via email to