[ 
https://issues.apache.org/jira/browse/SAMZA-65?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968168#comment-13968168
 ] 

Zhijie Shen commented on SAMZA-65:
----------------------------------

[~criccomini], if I understand correctly, the posted diagnostics are from 
application instead of from a particular AM container.

You saw this message because AM crashed twice due to some exceptions. YARN saw 
your application had no more chances to retry, it failed this application and 
recorded this diagnostics. Unfortunately, AFAIK, AM haven't the other channel 
to talk to YARN to record the diagnostics than unregisterApplicationMaster.

However, unregisterApplicationMaster is different story. If we choose to catch 
the exception in AM and put the exception as diagnostics message into the 
request unregisterApplicationMaster, this diagnostics will be recorded by YARN. 
However, YARN will think the application is FINISHED instead of FAILED, though 
AM can tell YARN that the FinalApplicationStatus is FAILED. This is still okay, 
but we need to be careful that if AM calls unregisterApplicationMaster, it 
won't get the opportunity to be retried by YARN, and then a Samza job may not 
be able to survive from some transit failures.

> Samza should use YARN's setDiagnosticMessage command when failures occur
> ------------------------------------------------------------------------
>
>                 Key: SAMZA-65
>                 URL: https://issues.apache.org/jira/browse/SAMZA-65
>             Project: Samza
>          Issue Type: Bug
>          Components: yarn
>    Affects Versions: 0.6.0
>            Reporter: Chris Riccomini
>
> Currently, when an AM container fails, the diagnostic message reads:
> {noformat}
> Diagnostics:  
> Application application_1382474502616_0004 failed 2 times due to AM Container 
> for appattempt_1382474502616_0004_000002 exited with exitCode: 1 due to: 
> Exception from container-launch:
> org.apache.hadoop.util.Shell$ExitCodeException:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
> at org.apache.hadoop.util.Shell.run(Shell.java:379)
> at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:619)
> .Failing this attempt.. Failing the application.
> {noformat}
> Users then generally click through to the AM logs to see the stderr message.
> Samza actually knows what exception is being thrown, which triggers the 
> non-zero exit code. It should set a better diagnostic with the actual stack 
> trace.
> This change should definitely be made for the Samza AM.
> I'm not sure how to best handle this with SamzaContainer, since it is 
> job-type agnostic, and doesn't know anything about YARN. For now, I thin it's 
> best to only do the AM.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to