[ 
https://issues.apache.org/jira/browse/TEZ-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14639496#comment-14639496
 ] 

Hitesh Shah commented on TEZ-2311:
----------------------------------

bq. Change on DAGImpl is due to one scenario (Some vertices' recoveredState is 
KILLED, while others are still in RUNNING. In that case, we need to kill other 
RUNNING vertices in DAGImpl#VertexCompletedTransition.

I think the fix here probably should be the other way around. Using one vertex 
KILLED state on recovery should not make the DAG start killing everything else. 
It seems the better fix is to log the dag kill event being received in recovery 
log and if the dag kill does not finish before the AM crashes, then on 
recovery, process the recovery log and complete the kill process as needed. 

Infering a kill seems a bit confusing as there can be multiple scenarios where 
a vertex was killed. Consider the case I mentioned above. In a normal flow, all 
vertices apart from A will end up as KILLED with termination cause as "other 
vertex failure". When recovery, the vertices will have termination cause "dag 
kill" which is incorrect. 

If the hang issue is being resolved by the vertex impl changes, we can converge 
on a fix for that in this jira and consider the dag handling as a separate one 
unless you believe that the hang will not be completely resolved without the 
DAGImpl change. 

> AM can hang if kill received while recovering from previous attempt
> -------------------------------------------------------------------
>
>                 Key: TEZ-2311
>                 URL: https://issues.apache.org/jira/browse/TEZ-2311
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Jason Lowe
>            Assignee: Jeff Zhang
>              Labels: Recovery
>         Attachments: TEZ-2311-1.patch, TEZ-2311-2.patch
>
>
> We saw an instance of a Tez job hanging despite receiving multiple kill 
> requests from clients.  The AM was recovering from a prior attempt when the 
> first kill request arrived.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to