[
https://issues.apache.org/jira/browse/OOZIE-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876612#comment-14876612
]
Robert Kanter commented on OOZIE-1978:
--------------------------------------
FYI: I've been slowly working on this and have it roughly halfway done.
We've also seen an issue due to this where a Coordinator gets stuck and you
can't even kill it because the Kill command can't acquire the Coordinator's
lock since it's being held while the forkjoin validation is taking forever.
> Forkjoin validation code is ridiculously slow in some cases
> -----------------------------------------------------------
>
> Key: OOZIE-1978
> URL: https://issues.apache.org/jira/browse/OOZIE-1978
> Project: Oozie
> Issue Type: Bug
> Components: core
> Affects Versions: trunk, 4.0.1
> Reporter: Robert Kanter
> Assignee: Robert Kanter
> Fix For: trunk
>
> Attachments: workflow.xml
>
>
> We've had a few users who have run into problems where submitting a workflow
> appears to hang (in the case of a subworkflow, it's similar but stuck in
> PREP). It turns out that if you wait long enough, it will actually go
> through and the workflow will run normally. The problem is that the forkjoin
> validation code is taking a really long time.
> The attached example has a series of 20 forks where each fork has 6 actions
> (it's based on an actual workflow, but all of the names were changed and the
> actions were all replaced by simple shell actions). One of our support guys
> said it took 1-2 hours , but on my computer it was taking {color:red}*15+
> hours*{color} (I had to cancel it)
> While this example doesn't have any nested forks, those can also take a long
> time too.
> It's easy to verify that it's the forkjoin validation code that's taking so
> long by looking at a jstack of the Oozie server and seeing deep recursive
> calls to
> {{org.apache.oozie.workflow.lite.LiteWorkflowAppParser.validateForkJoin}}. I
> also noticed a lot of sitting around in calls LinkedList.contains.
> I think we have 3 options:
> # See if we can make the existing code faster somehow. Perhaps there's a way
> to parallelize it? Maybe there's some redundant checking that we can
> identify and skip? Change some data structures? etc
> # See if we can write a new way to do this validation. I had originally
> completely rewritten this code a while ago, and we've since made a few fixes
> to catch edge cases and things. Perhaps it needs another rewrite?
> # Try to identify when it's taking a long time and at least let the user know
> what's happening or something. Right now, it just appears that the Oozie CLI
> has hung and the job doesn't show up in the Oozie server. Most users aren't
> going to wait more than a minute or two.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)