Peter Bacsko created OOZIE-2854:
-----------------------------------
Summary: Oozie should handle transient DB problems
Key: OOZIE-2854
URL: https://issues.apache.org/jira/browse/OOZIE-2854
Project: Oozie
Issue Type: Improvement
Components: core
Reporter: Peter Bacsko
Assignee: Peter Bacsko
There can be problems when Oozie cannot update the database properly. Recently,
we have experienced erratic behavior with two setups:
* MySQL was set up with the Galera cluster manager. Galera uses cluster-wide
optimistic locking which might cause a transaction to rollback if there are two
or more parallel transaction running and one of them cannot complete because of
a conflict.
* Another setup is MySQL with Percona XtraDB Cluster. If one of the MySQL
instances is killed, Oozie might get "Communications link failure" exception.
The problem is that failed DB transactions later might cause a workflow (which
are started/re-started by RecoveryService) to get stuck. It's not clear to us
how this happens but it has to do with the fact that certain DB updates are not
executed.
The solution is to use some sort of retry logic with exponential backoff if the
DB update fails. We could start with a 100ms wait time which is doubled at
every retry. The operation can be considered a failure if it still fails after
10 attempts. These values could be configurable. We should discuss initial
values in the scope of this JIRA.
Note that this solution is to handle *transient* failures. If the DB is long
for a longer period of time, we have to accept that the internal state of Oozie
is corrupted.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)