[jira] [Created] (OOZIE-1879) Workflow Rerun causes error depending on the order of forked nodes

Robert Kanter (JIRA) Wed, 11 Jun 2014 19:35:18 -0700

Robert Kanter created OOZIE-1879:
------------------------------------

             Summary: Workflow Rerun causes error depending on the order of 
forked nodes
                 Key: OOZIE-1879
                 URL: https://issues.apache.org/jira/browse/OOZIE-1879
             Project: Oozie
          Issue Type: Bug
          Components: core
    Affects Versions: trunk
            Reporter: Robert Kanter
            Assignee: Robert Kanter
            Priority: Blocker



Suppose you have a workflow like this:
{noformat}
start --> fork
fork --> shell1, shell2
shell1 --> join
shell2 --> join
join --> shell3
shell3 --> end
{noformat}
And all but shell3 are successful.  
Assuming you fix the problem with shell3, if you do a rerun, the following two 
outcomes can happen:
# If shell1 finished before shell2, then the rerun succeeds
# If shell2 finished before shell1, then the rerun fails

The error in the second outcome is simply this log message:
{noformat}
2014-05-29 17:17:03,735 ERROR 
org.apache.oozie.workflow.lite.LiteWorkflowInstance: 
SERVER[cdh5-1.cloudera.local] USER[pdvorak] GROUP[-] TOKEN[] APP[test-rerun-wf] 
JOB[0000004-140521220856264-oozie-oozi-W] 
ACTION[0000004-140521220856264-oozie-oozi-W@join] invalid execution path 
[/shell1/]
{noformat}

After a bunch of digging, I discovered that during a rerun with the above 
workflow or similar workflows, LiteWorkflowInstance#signal gets called for each 
action in the fork node in the order that they are listed in the fork node's 
XML; however, during the original run, LiteWorkflowInstance#signal gets called 
for each action in the order that they complete (i.e. endTime).  When these 
don't match, you get the above error.  The general fix for this is therefore to 
ensure that during a rerun, LiteWorkflowInstance#signal gets called for each 
action in the fork node in the order that they originally ran in.  And if you 
think about it, that is more correct than the current behavior anyway.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (OOZIE-1879) Workflow Rerun causes error depending on the order of forked nodes

Reply via email to