[jira] [Commented] (TEZ-2311) AM can hang if kill received while recovering from previous attempt

Hitesh Shah (JIRA) Wed, 29 Jul 2015 17:46:26 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646988#comment-14646988
 ]


Hitesh Shah commented on TEZ-2311:
----------------------------------

Comments: 

Minor grammar typo: "s/LOG.info("AM is crashed when shutting down in the 
previous AM"/ LOG.info("AM crashed when shutting down in the previous AM/"

{code}
if (recoveredDAGData.isSessionStopped) {
1778            // it should be fine that don't recover the dag when AM is 
crashed when shutting down
1779            LOG.info("AM is crashed when shutting down in the previous AM 
attempt"
1780                + ", continue the shutdown and recover it to SUCCEEDED");
1781            this.state = DAGAppMasterState.SUCCEEDED;
1782            this.taskSchedulerEventHandler.setShouldUnregisterFlag();
1783            shutdownHandler.shutdown();
1784            return;
1785          }
{code}
   - shouldn't this code block be after the dag is recovered to killed state? 
The current approach will leave the dag in an unfinished state? 

I would assume that the "killRequested" path should never be invoked - instead 
it should be just be invoking dag recover in a completed state with state = 
KILLED? i.e. recoveredDAGData.isCompleted should become true and state KILLED 
if a dagKillRequested event is seen ( similar to seeing a DAGFinishedEvent in 
summary only with state = killed ). Or am I missing something? 





 

> AM can hang if kill received while recovering from previous attempt
> -------------------------------------------------------------------
>
>                 Key: TEZ-2311
>                 URL: https://issues.apache.org/jira/browse/TEZ-2311
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Jason Lowe
>            Assignee: Jeff Zhang
>              Labels: Recovery
>         Attachments: TEZ-2311-1.patch, TEZ-2311-2.patch, TEZ-2311-3.patch, 
> TEZ-2311-4.patch, TEZ-2311-5.patch
>
>
> We saw an instance of a Tez job hanging despite receiving multiple kill 
> requests from clients.  The AM was recovering from a prior attempt when the 
> first kill request arrived.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-2311) AM can hang if kill received while recovering from previous attempt

Reply via email to