[
https://issues.apache.org/jira/browse/SAMZA-65?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13979934#comment-13979934
]
Zhijie Shen commented on SAMZA-65:
----------------------------------
bq. If the AM calls unregisterApplicaitionMaster to set the diagnostic message,
and it sets the status to FAILED, will YARN retry the AM again?
It won't. Once AM calls unregisterApplicaitionMaster, YARN will consider the
application finishes as normal. No matter how many attempts are remained, YARN
won't restart the AM.
bq. Is there a way to call unregisterApplicaitionMaster but allow YARN to
continue enforcing its yarn.resourcemanager.am.max-attempts behavior?
It's arguable thing. Currently, I'll say there isn't. This is because if AM
calls unregisterApplicaitionMaster, YARN will consider AM intentionally want to
finish the application. The retry opportunities are for the case that AM gets
crashed unexpectedly.
I think what is not friendly here is that no matter what kind of crash it
happens, the diagnostics message looks like almost same:
DefaultContainerExecutor composes the diagnostics message with the exception
stack instead of the exception message (probably the message may have more
useful information)
> Samza should use YARN's setDiagnosticMessage command when failures occur
> ------------------------------------------------------------------------
>
> Key: SAMZA-65
> URL: https://issues.apache.org/jira/browse/SAMZA-65
> Project: Samza
> Issue Type: Bug
> Components: yarn
> Affects Versions: 0.6.0
> Reporter: Chris Riccomini
>
> Currently, when an AM container fails, the diagnostic message reads:
> {noformat}
> Diagnostics:
> Application application_1382474502616_0004 failed 2 times due to AM Container
> for appattempt_1382474502616_0004_000002 exited with exitCode: 1 due to:
> Exception from container-launch:
> org.apache.hadoop.util.Shell$ExitCodeException:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
> at org.apache.hadoop.util.Shell.run(Shell.java:379)
> at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
> at
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:619)
> .Failing this attempt.. Failing the application.
> {noformat}
> Users then generally click through to the AM logs to see the stderr message.
> Samza actually knows what exception is being thrown, which triggers the
> non-zero exit code. It should set a better diagnostic with the actual stack
> trace.
> This change should definitely be made for the Samza AM.
> I'm not sure how to best handle this with SamzaContainer, since it is
> job-type agnostic, and doesn't know anything about YARN. For now, I thin it's
> best to only do the AM.
--
This message was sent by Atlassian JIRA
(v6.2#6252)