[
https://issues.apache.org/jira/browse/TEZ-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14643871#comment-14643871
]
Jeff Zhang commented on TEZ-2311:
---------------------------------
Upload another patch.
2 cases for recovery hang.
* Vertex killed/failed before its tasks are started ( Vertex wait there
indefinitely for its tasks' status)
** Solution: Recover its tasks to desired state without recovering its data.
* DAG is killed while running and AM shutdown before it gets all its vertices'
status.
** Solution: Create new event DAGKillRequestEvent as one critical event. And
will re-send the kill event in recovering.
[~hitesh] Please help review
> AM can hang if kill received while recovering from previous attempt
> -------------------------------------------------------------------
>
> Key: TEZ-2311
> URL: https://issues.apache.org/jira/browse/TEZ-2311
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.6.0
> Reporter: Jason Lowe
> Assignee: Jeff Zhang
> Labels: Recovery
> Attachments: TEZ-2311-1.patch, TEZ-2311-2.patch, TEZ-2311-3.patch
>
>
> We saw an instance of a Tez job hanging despite receiving multiple kill
> requests from clients. The AM was recovering from a prior attempt when the
> first kill request arrived.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)