Robert Kanter created OOZIE-1879:
------------------------------------
Summary: Workflow Rerun causes error depending on the order of
forked nodes
Key: OOZIE-1879
URL: https://issues.apache.org/jira/browse/OOZIE-1879
Project: Oozie
Issue Type: Bug
Components: core
Affects Versions: trunk
Reporter: Robert Kanter
Assignee: Robert Kanter
Priority: Blocker
Suppose you have a workflow like this:
{noformat}
start --> fork
fork --> shell1, shell2
shell1 --> join
shell2 --> join
join --> shell3
shell3 --> end
{noformat}
And all but shell3 are successful.
Assuming you fix the problem with shell3, if you do a rerun, the following two
outcomes can happen:
# If shell1 finished before shell2, then the rerun succeeds
# If shell2 finished before shell1, then the rerun fails
The error in the second outcome is simply this log message:
{noformat}
2014-05-29 17:17:03,735 ERROR
org.apache.oozie.workflow.lite.LiteWorkflowInstance:
SERVER[cdh5-1.cloudera.local] USER[pdvorak] GROUP[-] TOKEN[] APP[test-rerun-wf]
JOB[0000004-140521220856264-oozie-oozi-W]
ACTION[0000004-140521220856264-oozie-oozi-W@join] invalid execution path
[/shell1/]
{noformat}
After a bunch of digging, I discovered that during a rerun with the above
workflow or similar workflows, LiteWorkflowInstance#signal gets called for each
action in the fork node in the order that they are listed in the fork node's
XML; however, during the original run, LiteWorkflowInstance#signal gets called
for each action in the order that they complete (i.e. endTime). When these
don't match, you get the above error. The general fix for this is therefore to
ensure that during a rerun, LiteWorkflowInstance#signal gets called for each
action in the fork node in the order that they originally ran in. And if you
think about it, that is more correct than the current behavior anyway.
--
This message was sent by Atlassian JIRA
(v6.2#6252)