[
https://issues.apache.org/jira/browse/TEZ-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14647001#comment-14647001
]
Jeff Zhang commented on TEZ-2311:
---------------------------------
bq. shouldn't this code block be after the dag is recovered to killed state?
The current approach will leave the dag in an unfinished state?
Even put this code block after the dag recovering code, AM shutdown may still
shut down before the recovering complete because the the whole recovering is
handled in dispatcher thread. But put the shutdown block after the recovering
code would be better since we can have more chance to recover the dag before
the AM shutdown.
bq. I would assume that the "killRequested" path should never be invoked -
instead it should be just be invoking dag recover in a completed state with
state = KILLED? i.e. recoveredDAGData.isCompleted should become true and state
KILLED if a dagKillRequested event is seen ( similar to seeing a
DAGFinishedEvent in summary only with state = killed ). Or am I missing
something?
DAGKillRequestEvent means the kill action is triggered, but doesn't mean the
kill is completed. It is possible we only see DAGKillRequestEvent but no
DAGFinishedEvent
> AM can hang if kill received while recovering from previous attempt
> -------------------------------------------------------------------
>
> Key: TEZ-2311
> URL: https://issues.apache.org/jira/browse/TEZ-2311
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.6.0
> Reporter: Jason Lowe
> Assignee: Jeff Zhang
> Labels: Recovery
> Attachments: TEZ-2311-1.patch, TEZ-2311-2.patch, TEZ-2311-3.patch,
> TEZ-2311-4.patch, TEZ-2311-5.patch
>
>
> We saw an instance of a Tez job hanging despite receiving multiple kill
> requests from clients. The AM was recovering from a prior attempt when the
> first kill request arrived.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)