[
https://issues.apache.org/jira/browse/OOZIE-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13577273#comment-13577273
]
Robert Kanter commented on OOZIE-1205:
--------------------------------------
You're right that the FailXCommand is similar to failJob(); I first tried using
failJob() by itself. But what happens is that the second action gets marked as
KILLED. This is better than getting stuck RUNNING, but ideally we want the
second action to get marked as FAILED because its hadoop job also failed. It's
a minor difference and probably doesn't really matter, but FAILED is
technically more correct.
{quote}If a new command is added, then recovery for it should also be
handled{quote}
Do you mean we should do something (what?) if the command gets an exception?
> If the JobTracker is restarted during a Fork, Oozie doesn't fail all of the
> currently running actions
> -----------------------------------------------------------------------------------------------------
>
> Key: OOZIE-1205
> URL: https://issues.apache.org/jira/browse/OOZIE-1205
> Project: Oozie
> Issue Type: Bug
> Components: action
> Affects Versions: trunk
> Reporter: Robert Kanter
> Assignee: Robert Kanter
> Fix For: trunk, 3.3.2
>
> Attachments: OOZIE-1205.patch
>
>
> If you have a workflow with a fork and restart the JobTracker while its
> executing the paths in the fork, those two jobs will be lost (as expected).
> Once the timeout occurs on the {{ActionCheckXCommand}}, it will check both
> actions sequentially. While checking the first action, it sets the status to
> FAILED and also sets the workflow's status to FAILED. It then moves on to
> the other action that was running concurrently, but it cannot pass the
> precondition check because the workflow was already FAILED (the check
> requires that the Workflow is RUNNING). It will keep trying this every time
> the timeout hits (10min is default) and print a WARN message in the log.
> That action will also be in RUNNING state forever even though the underlying
> job isn't running and the WF is FAILED.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira