[
https://issues.apache.org/jira/browse/OOZIE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15996181#comment-15996181
]
Robert Kanter commented on OOZIE-2854:
--------------------------------------
Thanks for the PoC. I haven't tried it out, only looked at the code, but it
seems good overall. I think a max number of retries is more reliable than a
timeout, as we don't know how long each query will take (and I imagine
different queries with different amounts of data will take varying amounts of
time too). A few other comments:
- The new log message when getting the single results, {{LOG.info("No results
found");}}, doesn't seem helpful because it has no context. I think we can
just remove it.
- I think we should use a for loop in {{DBOperationRetryHandler}} instead of
setting {{retries}} to 0 and doing a {{retries++}} in the while condition.
{{for (int retries = 0; retires < maxRetryCount; retries++)}}
- Also in {{DBOperationRetryHandler}}, I would change the log messages. People
will see the error message and panic and might not notice the info message
right after. I think we should combine the two messages into a warn message
{code:java}
LOG.warn("Operation failed, Sleeping {0} msecs before retry #{1}", waitTime,
retries, e)
{code}
And we should have it log an error message only when the last retry fails,
which should also have a different message in any case because we're not going
to do any more retries.
> Oozie should handle transient DB problems
> -----------------------------------------
>
> Key: OOZIE-2854
> URL: https://issues.apache.org/jira/browse/OOZIE-2854
> Project: Oozie
> Issue Type: Improvement
> Components: core
> Reporter: Peter Bacsko
> Assignee: Peter Bacsko
> Attachments: OOZIE-2854-POC-001.patch
>
>
> There can be problems when Oozie cannot update the database properly.
> Recently, we have experienced erratic behavior with two setups:
> * MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic
> locking which might cause a transaction to rollback if there are two or more
> parallel transaction running and one of them cannot complete because of a
> conflict.
> * MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed,
> Oozie might get "Communications link failure" exception during the failover.
> The problem is that failed DB transactions later might cause a workflow
> (which are started/re-started by RecoveryService) to get stuck. It's not
> clear to us how this happens but it has to do with the fact that certain DB
> updates are not executed.
> The solution is to use some sort of retry logic with exponential backoff if
> the DB update fails. We could start with a 100ms wait time which is doubled
> at every retry. The operation can be considered a failure if it still fails
> after 10 attempts. These values could be configurable. We should discuss
> initial values in the scope of this JIRA.
> Note that this solution is to handle *transient* failures. If the DB is down
> for a longer period of time, we have to accept that the internal state of
> Oozie is corrupted.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)