[jira] [Commented] (MAPREDUCE-3034) NM should act on a REBOOT command from RM

Eric Payne (Commented) (JIRA) Wed, 08 Feb 2012 13:21:26 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13204001#comment-13204001
 ]


Eric Payne commented on MAPREDUCE-3034:
---------------------------------------

@Arun,

I'm pretty sure that the NodeStatusUpdaterImpl.stop() hierarchy already stops 
the AppMaster and Containers on the NM via the AsyncDispatcher event process. I 
was able to verify this by examining the code, running tests, and examining the 
logs.

# Verified by examining the code:
** When the reboot command comes from the RM to the NM, 
NodeStatusUpdaterImpl.reboot() sets the isRebooted flag and calls 
NodeStatusUpdaterImpl.stop().
** NodeStatusUpdaterImpl.stop() (eventually) calls both 
AbstractService.changeState() and CompositeService.stop(int 
numOfServicesStarted). These methods loop through the list of services 
registered with them and stop each one.
# Verified by running tests:
** With this change implemented, I started a long-running mapred job and then 
stopped and restarted the RM.
** During the interval between stopping and restarting the RM, I took a 
snapshot of the java processes running.
** Also, during the interval between stopping and restarting the RM, I searched 
the NM and container logs for messages from the AsyncDispatcher to determine if 
any services were stopped. None were.
** After restarting the RM, I took another snapshot of the java processes. An 
examination indicated that prior to starting the RM, the long-running mapred 
job was still running with the MRAppMaster and the container running in 
YarnChild. After the RM started again, the MRAppMaster and YarnChild processes 
were gone.
# Verified by examining logs:
** After running the above test, I did another search through the NM and 
container logs and found several services that had been stopped via the 
AsyncDispatcher event process. Specifically of interest, the ones from the 
container {{syslog}} file were these:
*** JobHistoryEventHandler
*** ContainerLauncherImpl
*** MRAppMaster$ContainerLauncherRouter
*** RMCommunicator
*** MRAppMaster$ContainerAllocatorRouter
*** MRClientService
*** TaskCleaner
*** TaskHeartbeatHandler 
*** TaskAttemptListenerImpl
*** Dispatcher
*** MRAppMaster

                
> NM should act on a REBOOT command from RM
> -----------------------------------------
>
>                 Key: MAPREDUCE-3034
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3034
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Devaraj K
>            Priority: Critical
>         Attachments: MAPREDUCE-3034-1.patch, MAPREDUCE-3034-2.patch, 
> MAPREDUCE-3034-3.patch, MAPREDUCE-3034.patch, MR-3034.txt
>
>
> RM sends a reboot command to NM in some cases, like when it gets lost and 
> rejoins back. In such a case, NM should act on the command and 
> reboot/reinitalize itself.
> This is akin to TT reinitialize on order from JT. We will need to shutdown 
> all the services properly and reinitialize - this should automatically take 
> care of killing of containers, cleaning up local temporary files etc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3034) NM should act on a REBOOT command from RM

Reply via email to