[ https://issues.apache.org/jira/browse/OOZIE-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15370872#comment-15370872 ]
Peter Bacsko edited comment on OOZIE-1978 at 7/11/16 2:30 PM: -------------------------------------------------------------- HI I was looking at this problem and I think I found out why it is so slow sometimes. Here is a small graph: {code} A1--->A2--->A3 A7--->A8--->A9 | | | | S--> F1 J1-->F2 J2--->E | | | | A4--->A5--->A6 A10-->A11-->A12 {code} The problem is that ALL possible execution path is covered by the {{validateForkJoin()}} method. For example, the first recursion choses the path F1->A1->A2->A3->J1 and then F2->A7->A8->A9->J2. We go back to check F2->A10->A11->A12->J2. Then the calls return to F1 where the second path is chosen: F1->A4->A5->A6. However, it again re-walks both paths for F2->J2. If you have a lot of fork-join pairs, that's a massive amount of possible paths. Eg. we have 10 ForkJoin pairs with 5 forks each, that's 500 paths it total. In the attached example, we have 20 pairs with 6 actions, that is 6^20 (3 656 158 440 062 976) paths. No wonder it doesn't finish in 15 hours. Also, at each invocation, we check if we found a cycle - that's the explanation for the lot of {{LinkedList.contains()}}. I'm already working on a solution which separately verifies the acyclic property of the graph and then it validates FJ. It's not yet finished, but I did a test run on the example and it took <1ms to complete. So looks like it's working. I'll attach the patch as soon as it passes all unit tests. was (Author: pbacsko): HI I was looking at this problem and I think I found out why it is so slow sometimes. Here is a small graph: {code} A1--->A2--->A3 A7--->A8--->A9 | | | | S--> F1 J1-->F2 J2--->E | | | | A4--->A5--->A6 A10-->A11-->A12 {code} The problem is that ALL possible execution path is covered by the {{validateForkJoin()}} method. For example, the first recursion choses the path F1->A1->A2->A3->J1 and then F2->A7->A8->A9->J2. We go back to check F2->A10->A11->A12->J2. Then the calls return to F1 where the second path is chosen: F1->A4->A5->A6. However, it again re-walks both paths for F2->J2. If you have a lot of fork-join pairs, that's a massive amount of possible paths. Eg. we have 10 ForkJoin pairs with 5 paths each, that's 500 path it total. In the attached example, we have 20 pairs with 6 actions, that is 6^20 (3 656 158 440 062 976) paths. No wonder it doesn't finish in 15 hours. Also, at each invocation, we check if we found a cycle - that's the explanation for the lot of {{LinkedList.contains()}}. I'm already working on a solution which separately verifies the acyclic property of the graph and then it validates FJ. It's not yet finished, but I did a test run on the example and it took <1ms to complete. So looks like it's working. I'll attach the patch as soon as it passes all unit tests. > Forkjoin validation code is ridiculously slow in some cases > ----------------------------------------------------------- > > Key: OOZIE-1978 > URL: https://issues.apache.org/jira/browse/OOZIE-1978 > Project: Oozie > Issue Type: Bug > Components: core > Affects Versions: trunk, 4.0.1 > Reporter: Robert Kanter > Assignee: Robert Kanter > Fix For: trunk > > Attachments: workflow.xml > > > We've had a few users who have run into problems where submitting a workflow > appears to hang (in the case of a subworkflow, it's similar but stuck in > PREP). It turns out that if you wait long enough, it will actually go > through and the workflow will run normally. The problem is that the forkjoin > validation code is taking a really long time. > The attached example has a series of 20 forks where each fork has 6 actions > (it's based on an actual workflow, but all of the names were changed and the > actions were all replaced by simple shell actions). One of our support guys > said it took 1-2 hours , but on my computer it was taking {color:red}*15+ > hours*{color} (I had to cancel it) > While this example doesn't have any nested forks, those can also take a long > time too. > It's easy to verify that it's the forkjoin validation code that's taking so > long by looking at a jstack of the Oozie server and seeing deep recursive > calls to > {{org.apache.oozie.workflow.lite.LiteWorkflowAppParser.validateForkJoin}}. I > also noticed a lot of sitting around in calls LinkedList.contains. > I think we have 3 options: > # See if we can make the existing code faster somehow. Perhaps there's a way > to parallelize it? Maybe there's some redundant checking that we can > identify and skip? Change some data structures? etc > # See if we can write a new way to do this validation. I had originally > completely rewritten this code a while ago, and we've since made a few fixes > to catch edge cases and things. Perhaps it needs another rewrite? > # Try to identify when it's taking a long time and at least let the user know > what's happening or something. Right now, it just appears that the Oozie CLI > has hung and the job doesn't show up in the Oozie server. Most users aren't > going to wait more than a minute or two. -- This message was sent by Atlassian JIRA (v6.3.4#6332)