[ 
https://issues.apache.org/jira/browse/TEZ-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14639341#comment-14639341
 ] 

Jeff Zhang commented on TEZ-2311:
---------------------------------

The root cause is that vertex is killed before its tasks being scheduled which 
means there's no recovery log for tasks and this result in vertex never get 
feedback from its tasks and stay in RUNNING state indefinitely. 

Attach a patch to fix it.

* Recover the vertex to FAILED/KILLED directly if its recoveredState is 
FAILED/KILLED and also recover its tasks to FAILED/KILLED without recovering 
its data.
* Change on DAGImpl is due to one scenario (Some vertices' recoveredState is 
KILLED, while others are still in RUNNING. In that case, we need to kill other 
RUNNING vertices in DAGImpl#VertexCompletedTransition.

[~hitesh] Please help review it. 

> AM can hang if kill received while recovering from previous attempt
> -------------------------------------------------------------------
>
>                 Key: TEZ-2311
>                 URL: https://issues.apache.org/jira/browse/TEZ-2311
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Jason Lowe
>            Assignee: Jeff Zhang
>              Labels: Recovery
>         Attachments: TEZ-2311-1.patch
>
>
> We saw an instance of a Tez job hanging despite receiving multiple kill 
> requests from clients.  The AM was recovering from a prior attempt when the 
> first kill request arrived.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to