[jira] [Commented] (TEZ-2311) AM can hang if kill received while recovering from previous attempt

Jeff Zhang (JIRA) Wed, 29 Jul 2015 18:01:25 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14647001#comment-14647001
 ]


Jeff Zhang commented on TEZ-2311:
---------------------------------

bq. shouldn't this code block be after the dag is recovered to killed state? 
The current approach will leave the dag in an unfinished state?
Even put this code block after the dag recovering code, AM shutdown may still 
shut down before the recovering complete because the the whole recovering is 
handled in dispatcher thread. But put the shutdown block after the recovering 
code would be better since we can have more chance to recover the dag before 
the AM shutdown. 

bq. I would assume that the "killRequested" path should never be invoked - 
instead it should be just be invoking dag recover in a completed state with 
state = KILLED? i.e. recoveredDAGData.isCompleted should become true and state 
KILLED if a dagKillRequested event is seen ( similar to seeing a 
DAGFinishedEvent in summary only with state = killed ). Or am I missing 
something?
DAGKillRequestEvent means the kill action is triggered, but doesn't mean the 
kill is completed. It is possible we only see DAGKillRequestEvent but no 
DAGFinishedEvent

> AM can hang if kill received while recovering from previous attempt
> -------------------------------------------------------------------
>
>                 Key: TEZ-2311
>                 URL: https://issues.apache.org/jira/browse/TEZ-2311
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Jason Lowe
>            Assignee: Jeff Zhang
>              Labels: Recovery
>         Attachments: TEZ-2311-1.patch, TEZ-2311-2.patch, TEZ-2311-3.patch, 
> TEZ-2311-4.patch, TEZ-2311-5.patch
>
>
> We saw an instance of a Tez job hanging despite receiving multiple kill 
> requests from clients.  The AM was recovering from a prior attempt when the 
> first kill request arrived.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-2311) AM can hang if kill received while recovering from previous attempt

Reply via email to