[
https://issues.apache.org/jira/browse/OOZIE-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Kanter updated OOZIE-1205:
---------------------------------
Attachment: OOZIE-1205.patch
{quote}
I don't think the second action should be allowed to wait to reach the FAILED
state (even if it is eventually going to reach the FAILED state due to hadoop
job failure).
{quote}
I think that makes sense because otherwise we'd run into "well how long should
we wait for the second action to possibly go into the FAILED state?"
I also think that a simpler solution that doesn't add a new XCommand is better,
especially for an edge case like this.
The new patch simply has ActionCheckXCommand call failJob() instead of
failAction(). I didn't add any unit tests because there weren't any existing
unit tests for this and I'm not really sure of a good/clean way to make one. I
did check that it works correctly though.
> If the JobTracker is restarted during a Fork, Oozie doesn't fail all of the
> currently running actions
> -----------------------------------------------------------------------------------------------------
>
> Key: OOZIE-1205
> URL: https://issues.apache.org/jira/browse/OOZIE-1205
> Project: Oozie
> Issue Type: Bug
> Components: action
> Affects Versions: trunk
> Reporter: Robert Kanter
> Assignee: Robert Kanter
> Attachments: OOZIE-1205.patch, OOZIE-1205.patch
>
>
> If you have a workflow with a fork and restart the JobTracker while its
> executing the paths in the fork, those two jobs will be lost (as expected).
> Once the timeout occurs on the {{ActionCheckXCommand}}, it will check both
> actions sequentially. While checking the first action, it sets the status to
> FAILED and also sets the workflow's status to FAILED. It then moves on to
> the other action that was running concurrently, but it cannot pass the
> precondition check because the workflow was already FAILED (the check
> requires that the Workflow is RUNNING). It will keep trying this every time
> the timeout hits (10min is default) and print a WARN message in the log.
> That action will also be in RUNNING state forever even though the underlying
> job isn't running and the WF is FAILED.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira