Robert Kanter created OOZIE-1978:
------------------------------------

             Summary: Forkjoin validation code is ridiculously slow in some 
cases
                 Key: OOZIE-1978
                 URL: https://issues.apache.org/jira/browse/OOZIE-1978
             Project: Oozie
          Issue Type: Bug
          Components: core
    Affects Versions: 4.0.1, trunk
            Reporter: Robert Kanter
             Fix For: trunk
         Attachments: workflow.xml

We've had a few users who have run into problems where submitting a workflow 
appears to hang (in the case of a subworkflow, it's similar but stuck in PREP). 
 It turns out that if you wait long enough, it will actually go through and the 
workflow will run normally.  The problem is that the forkjoin validation code 
is taking a really long time.

The attached example has a series of 20 forks where each fork has 6 actions 
(it's based on an actual workflow, but all of the names were changed and the 
actions were all replaced by simple shell actions).  One of our support guys 
said it took 1-2 hours , but on my computer it was taking {color:red}*15+ 
hours*{color} (I had to cancel it)
While this example doesn't have any nested forks, those can also take a long 
time too.

It's easy to verify that it's the forkjoin validation code that's taking so 
long by looking at a jstack of the Oozie server and seeing deep recursive calls 
to {{org.apache.oozie.workflow.lite.LiteWorkflowAppParser.validateForkJoin}}.  
I also noticed a lot of sitting around in calls LinkedList.contains.  

I think we have 3 options:
# See if we can make the existing code faster somehow.  Perhaps there's a way 
to parallelize it?  Maybe there's some redundant checking that we can identify 
and skip? Change some data structures? etc
# See if we can write a new way to do this validation.  I had originally 
completely rewritten this code a while ago, and we've since made a few fixes to 
catch edge cases and things.  Perhaps it needs another rewrite?
# Try to identify when it's taking a long time and at least let the user know 
what's happening or something.  Right now, it just appears that the Oozie CLI 
has hung and the job doesn't show up in the Oozie server.  Most users aren't 
going to wait more than a minute or two.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to