[ 
https://issues.apache.org/jira/browse/TEZ-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14647209#comment-14647209
 ] 

Hitesh Shah commented on TEZ-2311:
----------------------------------

Mostly looks good. 

One comment though: 

{code}
 if (recoveredDAGData.isSessionStopped) {
1831            LOG.info("AM crashed when shutting down in the previous attempt"
1832                + ", continue the shutdown and recover it to SUCCEEDED");
1833            this.state = DAGAppMasterState.SUCCEEDED;
1834            this.taskSchedulerEventHandler.setShouldUnregisterFlag();
1835            shutdownHandler.shutdown();
1836            return;
1837          }
{code}

Should the above only be done for session mode? For non-session mode, the final 
state should be set to KILLED. Also, I don't see "sessionStopped" set to false 
anywhere. 

One option for non-session mode is to let the dag recover as needed and then 
the AM will shutdown once the dag finished event comes back? Or just invoke 
shutdown before that happens. 
 

> AM can hang if kill received while recovering from previous attempt
> -------------------------------------------------------------------
>
>                 Key: TEZ-2311
>                 URL: https://issues.apache.org/jira/browse/TEZ-2311
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Jason Lowe
>            Assignee: Jeff Zhang
>              Labels: Recovery
>         Attachments: TEZ-2311-1.patch, TEZ-2311-2.patch, TEZ-2311-3.patch, 
> TEZ-2311-4.patch, TEZ-2311-5.patch, TEZ-2311-6.patch
>
>
> We saw an instance of a Tez job hanging despite receiving multiple kill 
> requests from clients.  The AM was recovering from a prior attempt when the 
> first kill request arrived.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to