[
https://issues.apache.org/jira/browse/SAMZA-65?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983224#comment-13983224
]
Chris Riccomini commented on SAMZA-65:
--------------------------------------
Hmm. Yea, this is problematic. The trick is, the AM can only report the
exception on the unregister call, and it should only unregister on its LAST
failure, BUT it doesn't (easily) know which is its last since this setting is
configurable at the yarn-site.xml level.
Maybe we should just close as won't fix for now? It seems like the YARN API
itself might need to be improved here, though I'm not quite sure of the best
way. Perhaps a way to supply an exception without shutting the entire job down
(i.e. something like, "I've failed, here's the exception, display the exception
in the RM UI, but follow your normal max-attempts logic.")
> Samza should use YARN's setDiagnosticMessage command when failures occur
> ------------------------------------------------------------------------
>
> Key: SAMZA-65
> URL: https://issues.apache.org/jira/browse/SAMZA-65
> Project: Samza
> Issue Type: Bug
> Components: yarn
> Affects Versions: 0.6.0
> Reporter: Chris Riccomini
>
> Currently, when an AM container fails, the diagnostic message reads:
> {noformat}
> Diagnostics:
> Application application_1382474502616_0004 failed 2 times due to AM Container
> for appattempt_1382474502616_0004_000002 exited with exitCode: 1 due to:
> Exception from container-launch:
> org.apache.hadoop.util.Shell$ExitCodeException:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
> at org.apache.hadoop.util.Shell.run(Shell.java:379)
> at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
> at
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:619)
> .Failing this attempt.. Failing the application.
> {noformat}
> Users then generally click through to the AM logs to see the stderr message.
> Samza actually knows what exception is being thrown, which triggers the
> non-zero exit code. It should set a better diagnostic with the actual stack
> trace.
> This change should definitely be made for the Samza AM.
> I'm not sure how to best handle this with SamzaContainer, since it is
> job-type agnostic, and doesn't know anything about YARN. For now, I thin it's
> best to only do the AM.
--
This message was sent by Atlassian JIRA
(v6.2#6252)