OOZIE-131: Support WF action level rery
---------------------------------------
Key: OOZIE-548
URL: https://issues.apache.org/jira/browse/OOZIE-548
Project: Oozie
Issue Type: New Feature
Reporter: Mohammad Kamrul Islam
Assignee: Roman Shaposhnik
While there are hadoop task level retry and oozie level retry for any transient
error, it is desirable to allow WF action level retry configured by user as
well.
In this proposed task, the following sub-tasks needs to be considered:
1. Enable user to specify the retry count and retry interval (time between two
successive tries).
2. Retry interval will be in minutes and the default value is 10 minutes. The
default value should be system level configuration.
3. Default retry count is 0 (no-retry), to keep backward compatible.
4. A new state called "RETRY" will be added in WF action. An action will be in
RETRY state, if the job failed and needs to be retried.
5. Three fields needs to be added into WF action table. retry_count, max_retry,
retry_interval.
6. Some services like Recovery service will periodically check for the
following sql "select action_id from WF_ACTIONS where status = 'RETRY' and
(last_modified_time + retry_interval ) < current_time and max_retry >
retry_count)" and queue RETRY_COMMAND. The last filter of SQL might not be
required.
5. RETRY_COMMAND will update the status from RETRY to PREP and push a
ActionStartXCommand.
Open Question:
a) Who will remove the temporary directories/files (such as ACTION_DIR) created
by Oozie? Is it part when the job moves to RETRY state? Or RETRY_COMMAND could
do it?
b) Do we need to keep historical information such as why the previous retries
failed? Historical information includes error code, error message etc.
c)anything else?
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira