[ 
https://issues.apache.org/jira/browse/TEZ-3462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15555638#comment-15555638
 ] 

Siddharth Seth commented on TEZ-3462:
-------------------------------------

Are there other scenarios which will cause the ShutdownHook to be invoked (kill 
task from the AM, shutdown container from the AM, anything from within the 
task)? Will we end up losing potential diagnostic information in these cases. 
For AM invoked actions, diagnostics are hopefully handled correctly. If there 
are any such actions invoked by the task itself - those will be lost, and the 
task timeout will end up getting triggered.

Another approach would be to add this diagnostic information after it is 
received by the AM (from the RM). That's a little more complicated to handle, 
since the ATS publish would already have happened. There's a parallel ask to 
update counters after a task completes (information like time to kill task).

> Task attempt failure during container shutdown loses useful container 
> diagnostics
> ---------------------------------------------------------------------------------
>
>                 Key: TEZ-3462
>                 URL: https://issues.apache.org/jira/browse/TEZ-3462
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.1
>            Reporter: Jason Lowe
>            Assignee: Eric Badger
>         Attachments: TEZ-3462.001.patch
>
>
> When a nodemanager kills a task attempt due to excessive memory usage it will 
> send a SIGTERM followed by a SIGKILL.  It also sends a useful diagnostic 
> message with the container completion event to the RM which will eventually 
> make it to the AM on a subsequent heartbeat.
> However if the JVM shutdown processing causes an error in the task (e.g.: 
> filesystem being closed by shutdown hook) then the task attempt can report a 
> failure before the useful NM diagnostic makes it to the AM.  The AM then 
> records some other error as the task failure reason, and by the time the 
> container completion status makes it to the AM it does not associate that 
> error with the task attempt and the useful information is lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to