[
https://issues.apache.org/jira/browse/TEZ-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14639496#comment-14639496
]
Hitesh Shah commented on TEZ-2311:
----------------------------------
bq. Change on DAGImpl is due to one scenario (Some vertices' recoveredState is
KILLED, while others are still in RUNNING. In that case, we need to kill other
RUNNING vertices in DAGImpl#VertexCompletedTransition.
I think the fix here probably should be the other way around. Using one vertex
KILLED state on recovery should not make the DAG start killing everything else.
It seems the better fix is to log the dag kill event being received in recovery
log and if the dag kill does not finish before the AM crashes, then on
recovery, process the recovery log and complete the kill process as needed.
Infering a kill seems a bit confusing as there can be multiple scenarios where
a vertex was killed. Consider the case I mentioned above. In a normal flow, all
vertices apart from A will end up as KILLED with termination cause as "other
vertex failure". When recovery, the vertices will have termination cause "dag
kill" which is incorrect.
If the hang issue is being resolved by the vertex impl changes, we can converge
on a fix for that in this jira and consider the dag handling as a separate one
unless you believe that the hang will not be completely resolved without the
DAGImpl change.
> AM can hang if kill received while recovering from previous attempt
> -------------------------------------------------------------------
>
> Key: TEZ-2311
> URL: https://issues.apache.org/jira/browse/TEZ-2311
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.6.0
> Reporter: Jason Lowe
> Assignee: Jeff Zhang
> Labels: Recovery
> Attachments: TEZ-2311-1.patch, TEZ-2311-2.patch
>
>
> We saw an instance of a Tez job hanging despite receiving multiple kill
> requests from clients. The AM was recovering from a prior attempt when the
> first kill request arrived.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)