[
https://issues.apache.org/jira/browse/OOZIE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16010200#comment-16010200
]
Peter Bacsko commented on OOZIE-2854:
-------------------------------------
Thanks for the comments [~rkanter].
Unfortunately I don't think we can use {{guava-retrier}}. The problem is that
some of the Executor implementation that is passed to {{JPAService}} throw
{{JPAExecutorException}} on purpose, like {{WorkflowJobGetJPAExecutor}} if a
given job is not found. So if we catch all of them, we can't always know
whether it's an underlying SQL problem or it came from the business logic
(that's why previous tests timed out before). We have to revert back to our
original implementation, but that's not enough in itself.
I think the next steps are:
1. Revert back to {{DBOperationRetryHandler}}
2. Catch {{JPAExecutorException}} and examine it:
* if it's an SQL error (E0603) then we retry
* otherwise we don't
> Oozie should handle transient DB problems
> -----------------------------------------
>
> Key: OOZIE-2854
> URL: https://issues.apache.org/jira/browse/OOZIE-2854
> Project: Oozie
> Issue Type: Improvement
> Components: core
> Reporter: Peter Bacsko
> Assignee: Peter Bacsko
> Attachments: OOZIE-2854-001.patch, OOZIE-2854-002.patch,
> OOZIE-2854-003.patch, OOZIE-2854-004.patch, OOZIE-2854-005.patch,
> OOZIE-2854-POC-001.patch
>
>
> There can be problems when Oozie cannot update the database properly.
> Recently, we have experienced erratic behavior with two setups:
> * MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic
> locking which might cause a transaction to rollback if there are two or more
> parallel transaction running and one of them cannot complete because of a
> conflict.
> * MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed,
> Oozie might get "Communications link failure" exception during the failover.
> The problem is that failed DB transactions later might cause a workflow
> (which are started/re-started by RecoveryService) to get stuck. It's not
> clear to us how this happens but it has to do with the fact that certain DB
> updates are not executed.
> The solution is to use some sort of retry logic with exponential backoff if
> the DB update fails. We could start with a 100ms wait time which is doubled
> at every retry. The operation can be considered a failure if it still fails
> after 10 attempts. These values could be configurable. We should discuss
> initial values in the scope of this JIRA.
> Note that this solution is to handle *transient* failures. If the DB is down
> for a longer period of time, we have to accept that the internal state of
> Oozie is corrupted.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)