[
https://issues.apache.org/jira/browse/OOZIE-2985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218844#comment-16218844
]
Peter Bacsko commented on OOZIE-2985:
-------------------------------------
[~dionusos] that's certainly possible and could help to a certain extent.
With the callback mechanism, we can avoid some initial polling after submitting
the application, so we just wait until we get a response.
I can imagine creating a new state - when the application is successfully
submitted, it goes into SUBMITTED state and not RUNNING. And when we receive
the initial callback from LauncherAM it's updated to RUNNING. That could also
indicate that something is wrong.
But again, the problem is that we can't wait indefinitely for the callback, so
we have to let it go after a while. Not to mention that it also adds further
complexity to the execution model. I'm not sure it's worth it.
> If LauncherAM fails, Oozie is not notified in a timely manner
> -------------------------------------------------------------
>
> Key: OOZIE-2985
> URL: https://issues.apache.org/jira/browse/OOZIE-2985
> Project: Oozie
> Issue Type: Bug
> Reporter: Attila Sasvari
>
> I've noticed if LauncherAM fails, Oozie is notified about the launcher's
> failure with a lot of delay. It gives the impression that the workflow is
> running.
> {{oozie job -oozie http://localhost:11000/oozie -config
> examples/apps/datelist-java-main/job.properties -info
> 0000000-170712153835057-oozie-asas-W}}
> {code}
> 0000000-170712153835057-oozie-asas-W@java1
> RUNNING application_1499866588585_0001RUNNING -
> {code}
> I've looked at yarn logs for the launcher and seen that the launcher failed.
> For example, in my case , during development, oozie-sharelib launcher was not
> found:
> {code}
> Error: Could not find or load main class
> org.apache.oozie.action.hadoop.LauncherAM
> {code}
> The problem is only after the specified timeout (by default 10 minutes) we
> see that the workflow has actually failed /errored.
> {code}
> Created : 2017-07-12 13:38 GMT
> Started : 2017-07-12 13:38 GMT
> Last Modified : 2017-07-12 13:49 GMT
> ...
> 0000000-170712153835057-oozie-asas-W@java1
> ERROR application_1499866588585_0001FAILED/KILLED-
> {code}
> The problem might be that in {{JavaActionExecutor}} in the {{start()}} method
> the check is too fast.
> {code}
> LOG.debug("Starting action " + action.getId() + " getting Action File
> System");
> FileSystem actionFs = context.getAppFileSystem();
> LOG.debug("Preparing action Dir through copying " +
> context.getActionDir());
> prepareActionDir(actionFs, context);
> LOG.debug("Action Dir is ready. Submitting the action ");
> submitLauncher(actionFs, context, action);
> LOG.debug("Action submit completed. Performing check ");
> check(context, action);
> LOG.debug("Action check is done after submission
> {code}
> There should be some delay after {{submitLauncher()}} before {{check()}}.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)