OOZIE-131: Support WF action level rery
---------------------------------------

                 Key: OOZIE-548
                 URL: https://issues.apache.org/jira/browse/OOZIE-548
             Project: Oozie
          Issue Type: New Feature
            Reporter: Mohammad Kamrul Islam
            Assignee: Roman Shaposhnik


While there are hadoop task level retry and oozie level retry for any transient 
error, it is desirable to allow WF action level retry configured by user as 
well.

In this proposed task, the following sub-tasks needs to be considered:

1. Enable user to specify the retry count and retry interval (time between two 
successive tries).
2. Retry interval will be in minutes and the default value is 10 minutes. The 
default value should be system level configuration.
3. Default retry count is 0 (no-retry), to keep backward compatible. 
4. A new state called "RETRY" will be added in WF action. An action will be in 
RETRY state, if the job failed and needs to be retried.
5. Three fields needs to be added into WF action table. retry_count, max_retry, 
retry_interval.
6. Some services like Recovery service will periodically check for the 
following sql "select action_id from WF_ACTIONS where status = 'RETRY' and 
(last_modified_time + retry_interval ) < current_time and max_retry > 
retry_count)" and queue RETRY_COMMAND. The last filter of SQL might not be 
required.
5. RETRY_COMMAND will update the status from RETRY to PREP and push a 
ActionStartXCommand.

Open Question:
a) Who will remove the temporary directories/files (such as ACTION_DIR) created 
by Oozie? Is it part when the job moves to RETRY state? Or RETRY_COMMAND could 
do it?
b) Do we need to keep historical information such as why the previous retries 
failed? Historical information includes error code, error message etc.
c)anything else?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to