[ 
https://issues.apache.org/jira/browse/OOZIE-548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dénes Bodó updated OOZIE-548:
-----------------------------
    Summary: OOZIE-131: Support WF action level retry  (was: OOZIE-131: Support 
WF action level rery)

> OOZIE-131: Support WF action level retry
> ----------------------------------------
>
>                 Key: OOZIE-548
>                 URL: https://issues.apache.org/jira/browse/OOZIE-548
>             Project: Oozie
>          Issue Type: New Feature
>            Reporter: Mohammad Islam
>            Assignee: Roman Shaposhnik
>            Priority: Major
>
> While there are hadoop task level retry and oozie level retry for any 
> transient error, it is desirable to allow WF action level retry configured by 
> user as well.
> In this proposed task, the following sub-tasks needs to be considered:
> 1. Enable user to specify the retry count and retry interval (time between 
> two successive tries).
> 2. Retry interval will be in minutes and the default value is 10 minutes. The 
> default value should be system level configuration.
> 3. Default retry count is 0 (no-retry), to keep backward compatible. 
> 4. A new state called "RETRY" will be added in WF action. An action will be 
> in RETRY state, if the job failed and needs to be retried.
> 5. Three fields needs to be added into WF action table. retry_count, 
> max_retry, retry_interval.
> 6. Some services like Recovery service will periodically check for the 
> following sql "select action_id from WF_ACTIONS where status = 'RETRY' and 
> (last_modified_time + retry_interval ) < current_time and max_retry > 
> retry_count)" and queue RETRY_COMMAND. The last filter of SQL might not be 
> required.
> 5. RETRY_COMMAND will update the status from RETRY to PREP and push a 
> ActionStartXCommand.
> Open Question:
> a) Who will remove the temporary directories/files (such as ACTION_DIR) 
> created by Oozie? Is it part when the job moves to RETRY state? Or 
> RETRY_COMMAND could do it?
> b) Do we need to keep historical information such as why the previous retries 
> failed? Historical information includes error code, error message etc.
> c)anything else?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to