[
https://issues.apache.org/jira/browse/OOZIE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Kanter updated OOZIE-1849:
---------------------------------
Attachment: OOZIE-1849.patch
The patch does my second solution where ActionCheckXCommand's precondition
allows the job to be RUNNING or SUSPENDED. I verified that it works correctly
in practice and updated a unit test. I also fixed a problem with that unit
test where it was previously passing for the wrong reason and cleaned up some
of the other tests to make sure there aren't false test passes.
> If the underlying job finishes while a Workflow is suspended, Oozie can take
> a while to realize it
> --------------------------------------------------------------------------------------------------
>
> Key: OOZIE-1849
> URL: https://issues.apache.org/jira/browse/OOZIE-1849
> Project: Oozie
> Issue Type: Improvement
> Components: core
> Affects Versions: 4.0.1
> Reporter: Robert Kanter
> Assignee: Robert Kanter
> Attachments: OOZIE-1849.patch
>
>
> Suppose you have a Workflow and you suspend it while one of the actions is
> still RUNNING. The underlying MR/Pig/etc job will continue running (as
> expected, because we can't pause those). However, if that job finishes while
> the workflow is SUSPENDED, the CallbackServlet will receive the callback, but
> the ActionCheckXCommand won't update the action:
> {noformat}
> 2014-05-16 17:40:57,959 INFO CallbackServlet:541 - SERVER[rkanter-mbp.local]
> USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000002-140516173529928-oozie-rkan-W]
> ACTION[0000002-140516173529928-oozie-rkan-W@mr-node] callback for action
> [0000002-140516173529928-oozie-rkan-W@mr-node]
> 2014-05-16 17:40:57,985 WARN ActionCheckXCommand:544 -
> SERVER[rkanter-mbp.local] USER[rkanter] GROUP[-] TOKEN[] APP[map-reduce-wf]
> JOB[0000002-140516173529928-oozie-rkan-W]
> ACTION[0000002-140516173529928-oozie-rkan-W@mr-node] E0818: Action
> [0000002-140516173529928-oozie-rkan-W@mr-node] status is running but WF Job
> [0000002-140516173529928-oozie-rkan-W] status is [SUSPENDED]. Expected status
> is RUNNING., Error Code: E0818
> {noformat}
> If you then resume the workflow, the action will stay RUNNING for up to 10
> minutes (the default fallback polling interval), at which point the
> ActionCheckerService will run an ActionCheckXCommand that will pass, check
> the job, and finally mark the action as SUCCESSFUL.
> We should fix this by one of the following:
> # ResumeXCommand should also queue a ActionCheckXCommand (if the workflow was
> SUSPENDED) so we don't have to wait for the ActionCheckerService
> # ActionCheckXCommand's precondition check should allow SUSPENDED workflows
--
This message was sent by Atlassian JIRA
(v6.2#6252)