[
https://issues.apache.org/jira/browse/TEZ-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14639341#comment-14639341
]
Jeff Zhang commented on TEZ-2311:
---------------------------------
The root cause is that vertex is killed before its tasks being scheduled which
means there's no recovery log for tasks and this result in vertex never get
feedback from its tasks and stay in RUNNING state indefinitely.
Attach a patch to fix it.
* Recover the vertex to FAILED/KILLED directly if its recoveredState is
FAILED/KILLED and also recover its tasks to FAILED/KILLED without recovering
its data.
* Change on DAGImpl is due to one scenario (Some vertices' recoveredState is
KILLED, while others are still in RUNNING. In that case, we need to kill other
RUNNING vertices in DAGImpl#VertexCompletedTransition.
[~hitesh] Please help review it.
> AM can hang if kill received while recovering from previous attempt
> -------------------------------------------------------------------
>
> Key: TEZ-2311
> URL: https://issues.apache.org/jira/browse/TEZ-2311
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.6.0
> Reporter: Jason Lowe
> Assignee: Jeff Zhang
> Labels: Recovery
> Attachments: TEZ-2311-1.patch
>
>
> We saw an instance of a Tez job hanging despite receiving multiple kill
> requests from clients. The AM was recovering from a prior attempt when the
> first kill request arrived.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)