[jira] [Commented] (TEZ-3462) Task attempt failure during container shutdown loses useful container diagnostics

Jason Lowe (JIRA) Wed, 19 Oct 2016 06:47:41 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-3462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15588813#comment-15588813
 ]


Jason Lowe commented on TEZ-3462:
---------------------------------

bq. Are there other scenarios which will cause the ShutdownHook to be invoked 
(kill task from the AM, shutdown container from the AM, anything from within 
the task)? Will we end up losing potential diagnostic information in these 
cases.

It's important to note that we shouldn't be completely losing the diagnostic 
information for these cases.  The information should still be emitted to the 
logs.  The main difference is we won't be advertising these failures during 
shutdown as the final status of the task.  The assumption here is that if the 
JVM is shutting down it's because of a more relevant event, and that other 
event should be used as the diagnostics for the task rather than these failures 
during shutdown, as it's very likely these failures are a side-effect of the 
shutdown rather than the cause of the task going down.

If we want to do this as a separate, fixup ATS event publish I'm OK with that 
too.  The key point here is that the user shouldn't be shown misleading 
diagnostics from errors caused by a JVM shutdown as it just sends them on wild 
goose chases.  Part of the challenge of doing the post fixup event approach is 
knowing when it's appropriate for the container's diagnostics to override the 
task's diagnostics.

> Task attempt failure during container shutdown loses useful container 
> diagnostics
> ---------------------------------------------------------------------------------
>
>                 Key: TEZ-3462
>                 URL: https://issues.apache.org/jira/browse/TEZ-3462
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.1
>            Reporter: Jason Lowe
>            Assignee: Eric Badger
>         Attachments: TEZ-3462.001.patch
>
>
> When a nodemanager kills a task attempt due to excessive memory usage it will 
> send a SIGTERM followed by a SIGKILL.  It also sends a useful diagnostic 
> message with the container completion event to the RM which will eventually 
> make it to the AM on a subsequent heartbeat.
> However if the JVM shutdown processing causes an error in the task (e.g.: 
> filesystem being closed by shutdown hook) then the task attempt can report a 
> failure before the useful NM diagnostic makes it to the AM.  The AM then 
> records some other error as the task failure reason, and by the time the 
> container completion status makes it to the AM it does not associate that 
> error with the task attempt and the useful information is lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-3462) Task attempt failure during container shutdown loses useful container diagnostics

Reply via email to