[
https://issues.apache.org/jira/browse/TEZ-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14645410#comment-14645410
]
Jeff Zhang commented on TEZ-2311:
---------------------------------
Attach a new patch.
* Add isSessionStopped in DAGKillRequestEvent. And shutdown AM without
recovering dag when isSessionStopped is true.
** Although I add isSessionStopped in DAGKillRequestEvent, it doesn't solve the
problem completely. For the case that AM is crashed before RM unregister
happened, it could also happen when dag is not necessary to be killed
(currentDAG is null or currentDAG is completed). So we may need to make new
event like AMShutdownRequestEvent to track that AM is requested to shutdown. To
recap, the recovery in Tez doesn't only mean the recovering of DAG, also
include the recovering of AM. There's several side-effort/request from outside
on AM.
*** submit dag ( DAGSubmittedEvent cover it )
*** kill dag (DAGKillRequestEvent cover it )
*** kill AM (No event track this, will create another jira for it )
* Recover dag to desired state KILLED when dag get kill request
[~hitesh] Please help review.
> AM can hang if kill received while recovering from previous attempt
> -------------------------------------------------------------------
>
> Key: TEZ-2311
> URL: https://issues.apache.org/jira/browse/TEZ-2311
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.6.0
> Reporter: Jason Lowe
> Assignee: Jeff Zhang
> Labels: Recovery
> Attachments: TEZ-2311-1.patch, TEZ-2311-2.patch, TEZ-2311-3.patch,
> TEZ-2311-4.patch
>
>
> We saw an instance of a Tez job hanging despite receiving multiple kill
> requests from clients. The AM was recovering from a prior attempt when the
> first kill request arrived.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)