[ 
https://issues.apache.org/jira/browse/SAMZA-65?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13979934#comment-13979934
 ] 

Zhijie Shen commented on SAMZA-65:
----------------------------------

bq. If the AM calls unregisterApplicaitionMaster to set the diagnostic message, 
and it sets the status to FAILED, will YARN retry the AM again?

It won't. Once AM calls unregisterApplicaitionMaster, YARN will consider the 
application finishes as normal. No matter how many attempts are remained, YARN 
won't restart the AM.

bq. Is there a way to call unregisterApplicaitionMaster but allow YARN to 
continue enforcing its yarn.resourcemanager.am.max-attempts behavior?

It's arguable thing. Currently, I'll say there isn't. This is because if AM 
calls unregisterApplicaitionMaster, YARN will consider AM intentionally want to 
finish the application. The retry opportunities are for the case that AM gets 
crashed unexpectedly.

I think what is not friendly here is that no matter what kind of crash it 
happens, the diagnostics message looks like almost same: 
DefaultContainerExecutor composes the diagnostics message with the exception 
stack instead of the exception message (probably the message may have more 
useful information)

> Samza should use YARN's setDiagnosticMessage command when failures occur
> ------------------------------------------------------------------------
>
>                 Key: SAMZA-65
>                 URL: https://issues.apache.org/jira/browse/SAMZA-65
>             Project: Samza
>          Issue Type: Bug
>          Components: yarn
>    Affects Versions: 0.6.0
>            Reporter: Chris Riccomini
>
> Currently, when an AM container fails, the diagnostic message reads:
> {noformat}
> Diagnostics:  
> Application application_1382474502616_0004 failed 2 times due to AM Container 
> for appattempt_1382474502616_0004_000002 exited with exitCode: 1 due to: 
> Exception from container-launch:
> org.apache.hadoop.util.Shell$ExitCodeException:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
> at org.apache.hadoop.util.Shell.run(Shell.java:379)
> at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:619)
> .Failing this attempt.. Failing the application.
> {noformat}
> Users then generally click through to the AM logs to see the stderr message.
> Samza actually knows what exception is being thrown, which triggers the 
> non-zero exit code. It should set a better diagnostic with the actual stack 
> trace.
> This change should definitely be made for the Samza AM.
> I'm not sure how to best handle this with SamzaContainer, since it is 
> job-type agnostic, and doesn't know anything about YARN. For now, I thin it's 
> best to only do the AM.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to