[
https://issues.apache.org/jira/browse/TEZ-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14639538#comment-14639538
]
Jeff Zhang commented on TEZ-2311:
---------------------------------
I thought about the adding recovery log for DAG kill operation, but may be a
little heavy here. Think it again, seems not difficult ( Will post another
patch ). The change in VertexImpl doesn't resolve the hang issue completely.
Consider one case that all the vertices are recovered to KILLED, and one vertex
is recovered to running and new task attempt is scheduled. That new task
attempt may wait there indefinitely for datamovement events from its upstream.
Or maybe task attempt is not scheduled, its VertexManager may wait there for
something from upstream.
> AM can hang if kill received while recovering from previous attempt
> -------------------------------------------------------------------
>
> Key: TEZ-2311
> URL: https://issues.apache.org/jira/browse/TEZ-2311
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.6.0
> Reporter: Jason Lowe
> Assignee: Jeff Zhang
> Labels: Recovery
> Attachments: TEZ-2311-1.patch, TEZ-2311-2.patch
>
>
> We saw an instance of a Tez job hanging despite receiving multiple kill
> requests from clients. The AM was recovering from a prior attempt when the
> first kill request arrived.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)