[jira] [Commented] (TEZ-2311) AM can hang if kill received while recovering from previous attempt

Jeff Zhang (JIRA) Tue, 28 Jul 2015 16:33:37 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14645195#comment-14645195
 ]


Jeff Zhang commented on TEZ-2311:
---------------------------------

bq. why is the dag kill requested being logged in DAGImpl and not in 
DAGAppMaster when the request was first received? i.e. as part of either 
sessionStop or the tryKillDAG call.
bq. The fix does not really solve the case where the user tried to stop the 
session but the AM crashed before the RM unregister happened and some recovery 
logs were not fully synced. Logging the event when the kill request was 
received by the initial handler could also add a flag as to whether a session 
stop was requested.
Right, seems logging event in DAGAppMaster would be the only way to handle this 
case. 

bq. Any reason why the full process of enacting a dag kill is being implemented 
as compared to just using the same approach as recovering a dag finished 
summary event with killed state? Wouldnt that be simpler? There is probably no 
need to try and recover all kinds of information for a killed dag but simply a 
case of moving all objects to a final desired state? This obviously should only 
be done if the dag had not completed ( i.e not in a successful or failed state 
as DAG_KILL is ignored in a successful or failed state ).
The purpose is to reuse the code in 
KillNewJobTransition/KillInitedJobTransition/DAGKilledTransition such 
setTerminationCause. Could do it as to recover to the desired state. 

> AM can hang if kill received while recovering from previous attempt
> -------------------------------------------------------------------
>
>                 Key: TEZ-2311
>                 URL: https://issues.apache.org/jira/browse/TEZ-2311
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Jason Lowe
>            Assignee: Jeff Zhang
>              Labels: Recovery
>         Attachments: TEZ-2311-1.patch, TEZ-2311-2.patch, TEZ-2311-3.patch
>
>
> We saw an instance of a Tez job hanging despite receiving multiple kill 
> requests from clients.  The AM was recovering from a prior attempt when the 
> first kill request arrived.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-2311) AM can hang if kill received while recovering from previous attempt

Reply via email to