[jira] [Commented] (TEZ-2311) AM can hang if kill received while recovering from previous attempt

Hitesh Shah (JIRA) Tue, 28 Jul 2015 10:23:19 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644672#comment-14644672
 ]


Hitesh Shah commented on TEZ-2311:
----------------------------------

Comments: 

   - why is the dag kill requested being logged in DAGImpl and not in 
DAGAppMaster when the request was first received? i.e. as part of either 
sessionStop or the tryKillDAG call.
   - The fix does not really solve the case where the user tried to stop the 
session but the AM crashed before the RM unregister happened and some recovery 
logs were not fully synced. Logging the event when the kill request was 
received by the initial handler could also add a flag as to whether a session 
stop was requested. 
   - Any reason why the full process of enacting a dag kill is being 
implemented as compared to just using the same approach as recovering a dag 
finished summary event with killed state? Wouldnt that be simpler? There is 
probably no need to try and recover all kinds of information for a killed dag 
but simply a case of moving all objects to a final desired state? This 
obviously should only be done if the dag had not completed ( i.e not in a 
successful or failed state as DAG_KILL is ignored in a successful or failed 
state ). 



 

> AM can hang if kill received while recovering from previous attempt
> -------------------------------------------------------------------
>
>                 Key: TEZ-2311
>                 URL: https://issues.apache.org/jira/browse/TEZ-2311
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Jason Lowe
>            Assignee: Jeff Zhang
>              Labels: Recovery
>         Attachments: TEZ-2311-1.patch, TEZ-2311-2.patch, TEZ-2311-3.patch
>
>
> We saw an instance of a Tez job hanging despite receiving multiple kill 
> requests from clients.  The AM was recovering from a prior attempt when the 
> first kill request arrived.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-2311) AM can hang if kill received while recovering from previous attempt

Reply via email to