[ 
https://issues.apache.org/jira/browse/OOZIE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008511#comment-16008511
 ] 

Robert Kanter commented on OOZIE-2854:
--------------------------------------

Looks good overall, the guava-retrying stuff looks very clean.  Here's a few 
related things I think we should change:
# The patch loads the config properties every time we create a {{Retryer}}.  I 
think we should move that into {{JPAService}} so we can load them once, and 
then pass them to the {{Retryer}}.  
# The {{Retryer}} is general enough that I think it should go in 
{{org.apache.oozie.util}} so that other code can use it, which is another 
reason not to tie the configs directly to the {{Retryer}}.  I'm sure we have 
some other places in Oozie where we have custom retrying code that could be 
replaced by this, but that's out of scope for this JIRA.
# All of the database calls are in {{JPAService}}, except for 
{{QueryExecutor#insert}}.  That's not your doing, but I think we should move it 
to {{JPAService}} to be more consistent and so we can use the configs that 
{{JPAService}} will load in point 1.
# Let's rename the config properties to 
{{oozie.service.JPAService.retry.initial-wait-time.ms}} and 
{{oozie.service.JPAService.retry.max-retries}} now that they'd be moved into 
{{JPAService}}, plus that's more consistent with the other DB related configs.

Let me know if that doesn't make sense.

> Oozie should handle transient DB problems
> -----------------------------------------
>
>                 Key: OOZIE-2854
>                 URL: https://issues.apache.org/jira/browse/OOZIE-2854
>             Project: Oozie
>          Issue Type: Improvement
>          Components: core
>            Reporter: Peter Bacsko
>            Assignee: Peter Bacsko
>         Attachments: OOZIE-2854-001.patch, OOZIE-2854-002.patch, 
> OOZIE-2854-003.patch, OOZIE-2854-004.patch, OOZIE-2854-005.patch, 
> OOZIE-2854-POC-001.patch
>
>
> There can be problems when Oozie cannot update the database properly. 
> Recently, we have experienced erratic behavior with two setups:
> * MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic 
> locking which might cause a transaction to rollback if there are two or more 
> parallel transaction running and one of them cannot complete because of a 
> conflict.
> * MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, 
> Oozie might get "Communications link failure" exception during the failover.
> The problem is that failed DB transactions later might cause a workflow 
> (which are started/re-started by RecoveryService) to get stuck. It's not 
> clear to us how this happens but it has to do with the fact that certain DB 
> updates are not executed.
> The solution is to use some sort of retry logic with exponential backoff if 
> the DB update fails. We could start with a 100ms wait time which is doubled 
> at every retry. The operation can be considered a failure if it still fails 
> after 10 attempts. These values could be configurable. We should discuss 
> initial values in the scope of this JIRA.
> Note that this solution is to handle *transient* failures. If the DB is down 
> for a longer period of time, we have to accept that the internal state of 
> Oozie is corrupted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to