[
https://issues.apache.org/jira/browse/TEZ-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14647013#comment-14647013
]
Hitesh Shah commented on TEZ-2311:
----------------------------------
bq. DAGKillRequestEvent means the kill action is triggered, but doesn't mean
the kill is completed. It is possible we only see DAGKillRequestEvent but no
DAGFinishedEvent
Correct but that maybe fine I think as we can treat the dag as KILLED. What
kind of problems do you see if we potentially treat DAGKillRequestEvent summary
event as the same as seeing a DAGFinished SummaryEvent with state KILLED?
bq. Even put this code block after the dag recovering code, AM shutdown may
still shut down before the recovering complete because the the whole recovering
is handled in dispatcher thread. But put the shutdown block after the
recovering code would be better since we can have more chance to recover the
dag before the AM shutdown.
I think either approach is fine. Question is which one is better for the end
user state is i.e. YARN app diagnostics, data in timeline?
> AM can hang if kill received while recovering from previous attempt
> -------------------------------------------------------------------
>
> Key: TEZ-2311
> URL: https://issues.apache.org/jira/browse/TEZ-2311
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.6.0
> Reporter: Jason Lowe
> Assignee: Jeff Zhang
> Labels: Recovery
> Attachments: TEZ-2311-1.patch, TEZ-2311-2.patch, TEZ-2311-3.patch,
> TEZ-2311-4.patch, TEZ-2311-5.patch
>
>
> We saw an instance of a Tez job hanging despite receiving multiple kill
> requests from clients. The AM was recovering from a prior attempt when the
> first kill request arrived.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)