[
https://issues.apache.org/jira/browse/TEZ-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646988#comment-14646988
]
Hitesh Shah commented on TEZ-2311:
----------------------------------
Comments:
Minor grammar typo: "s/LOG.info("AM is crashed when shutting down in the
previous AM"/ LOG.info("AM crashed when shutting down in the previous AM/"
{code}
if (recoveredDAGData.isSessionStopped) {
1778 // it should be fine that don't recover the dag when AM is
crashed when shutting down
1779 LOG.info("AM is crashed when shutting down in the previous AM
attempt"
1780 + ", continue the shutdown and recover it to SUCCEEDED");
1781 this.state = DAGAppMasterState.SUCCEEDED;
1782 this.taskSchedulerEventHandler.setShouldUnregisterFlag();
1783 shutdownHandler.shutdown();
1784 return;
1785 }
{code}
- shouldn't this code block be after the dag is recovered to killed state?
The current approach will leave the dag in an unfinished state?
I would assume that the "killRequested" path should never be invoked - instead
it should be just be invoking dag recover in a completed state with state =
KILLED? i.e. recoveredDAGData.isCompleted should become true and state KILLED
if a dagKillRequested event is seen ( similar to seeing a DAGFinishedEvent in
summary only with state = killed ). Or am I missing something?
> AM can hang if kill received while recovering from previous attempt
> -------------------------------------------------------------------
>
> Key: TEZ-2311
> URL: https://issues.apache.org/jira/browse/TEZ-2311
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.6.0
> Reporter: Jason Lowe
> Assignee: Jeff Zhang
> Labels: Recovery
> Attachments: TEZ-2311-1.patch, TEZ-2311-2.patch, TEZ-2311-3.patch,
> TEZ-2311-4.patch, TEZ-2311-5.patch
>
>
> We saw an instance of a Tez job hanging despite receiving multiple kill
> requests from clients. The AM was recovering from a prior attempt when the
> first kill request arrived.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)