[jira] [Commented] (TEZ-2311) AM can hang if kill received while recovering from previous attempt

Hitesh Shah (JIRA) Wed, 29 Jul 2015 18:12:20 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14647013#comment-14647013
 ]


Hitesh Shah commented on TEZ-2311:
----------------------------------

bq. DAGKillRequestEvent means the kill action is triggered, but doesn't mean 
the kill is completed. It is possible we only see DAGKillRequestEvent but no 
DAGFinishedEvent

Correct but that maybe fine I think as we can treat the dag as KILLED. What 
kind of problems do you see if we potentially treat DAGKillRequestEvent summary 
event as the same as seeing a DAGFinished SummaryEvent with state KILLED?

bq. Even put this code block after the dag recovering code, AM shutdown may 
still shut down before the recovering complete because the the whole recovering 
is handled in dispatcher thread. But put the shutdown block after the 
recovering code would be better since we can have more chance to recover the 
dag before the AM shutdown.

I think either approach is fine. Question is which one is better for the end 
user state is i.e. YARN app diagnostics, data in timeline? 


> AM can hang if kill received while recovering from previous attempt
> -------------------------------------------------------------------
>
>                 Key: TEZ-2311
>                 URL: https://issues.apache.org/jira/browse/TEZ-2311
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Jason Lowe
>            Assignee: Jeff Zhang
>              Labels: Recovery
>         Attachments: TEZ-2311-1.patch, TEZ-2311-2.patch, TEZ-2311-3.patch, 
> TEZ-2311-4.patch, TEZ-2311-5.patch
>
>
> We saw an instance of a Tez job hanging despite receiving multiple kill 
> requests from clients.  The AM was recovering from a prior attempt when the 
> first kill request arrived.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-2311) AM can hang if kill received while recovering from previous attempt

Reply via email to