[
https://issues.apache.org/jira/browse/OOZIE-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13931432#comment-13931432
]
Shwetha G S commented on OOZIE-1735:
------------------------------------
Can we categorise the failures as retriable and non-retriable and mark coord as
FAILED for non-retriable errors, and KILLED for retriable errors. Allow re-runs
on KILLED coords, but not on FAILED ones. For example, EL errors should fail
the coord as any re-run will not help. But DB/network connectivity issues
should kill the coord and allow re-runs on these. With this approach, the
status will clearly signify the kind of error and the user can do blind re-runs
on killed ones.
> Support re-running of failed coordinator and coordinator action
> ---------------------------------------------------------------
>
> Key: OOZIE-1735
> URL: https://issues.apache.org/jira/browse/OOZIE-1735
> Project: Oozie
> Issue Type: Bug
> Reporter: purshotam shah
> Assignee: purshotam shah
>
> We should support rerunning of failed job. Job are set to failed if there are
> runtime error( like SQL timeout).
> In current scenario there is no way to recover beside running SQL.
> Rerun should set coord status to running and also set pending to 1 ,reset
> doneMaterialization and last modified to current time. So that
> materialization continues.
> We should also provide an option of resuming failed action. The behavior will
> be same as killed option.
--
This message was sent by Atlassian JIRA
(v6.2#6252)