[
https://issues.apache.org/jira/browse/MAPREDUCE-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13204001#comment-13204001
]
Eric Payne commented on MAPREDUCE-3034:
---------------------------------------
@Arun,
I'm pretty sure that the NodeStatusUpdaterImpl.stop() hierarchy already stops
the AppMaster and Containers on the NM via the AsyncDispatcher event process. I
was able to verify this by examining the code, running tests, and examining the
logs.
# Verified by examining the code:
** When the reboot command comes from the RM to the NM,
NodeStatusUpdaterImpl.reboot() sets the isRebooted flag and calls
NodeStatusUpdaterImpl.stop().
** NodeStatusUpdaterImpl.stop() (eventually) calls both
AbstractService.changeState() and CompositeService.stop(int
numOfServicesStarted). These methods loop through the list of services
registered with them and stop each one.
# Verified by running tests:
** With this change implemented, I started a long-running mapred job and then
stopped and restarted the RM.
** During the interval between stopping and restarting the RM, I took a
snapshot of the java processes running.
** Also, during the interval between stopping and restarting the RM, I searched
the NM and container logs for messages from the AsyncDispatcher to determine if
any services were stopped. None were.
** After restarting the RM, I took another snapshot of the java processes. An
examination indicated that prior to starting the RM, the long-running mapred
job was still running with the MRAppMaster and the container running in
YarnChild. After the RM started again, the MRAppMaster and YarnChild processes
were gone.
# Verified by examining logs:
** After running the above test, I did another search through the NM and
container logs and found several services that had been stopped via the
AsyncDispatcher event process. Specifically of interest, the ones from the
container {{syslog}} file were these:
*** JobHistoryEventHandler
*** ContainerLauncherImpl
*** MRAppMaster$ContainerLauncherRouter
*** RMCommunicator
*** MRAppMaster$ContainerAllocatorRouter
*** MRClientService
*** TaskCleaner
*** TaskHeartbeatHandler
*** TaskAttemptListenerImpl
*** Dispatcher
*** MRAppMaster
> NM should act on a REBOOT command from RM
> -----------------------------------------
>
> Key: MAPREDUCE-3034
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3034
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2, nodemanager
> Affects Versions: 0.23.0, 0.24.0
> Reporter: Vinod Kumar Vavilapalli
> Assignee: Devaraj K
> Priority: Critical
> Attachments: MAPREDUCE-3034-1.patch, MAPREDUCE-3034-2.patch,
> MAPREDUCE-3034-3.patch, MAPREDUCE-3034.patch, MR-3034.txt
>
>
> RM sends a reboot command to NM in some cases, like when it gets lost and
> rejoins back. In such a case, NM should act on the command and
> reboot/reinitalize itself.
> This is akin to TT reinitialize on order from JT. We will need to shutdown
> all the services properly and reinitialize - this should automatically take
> care of killing of containers, cleaning up local temporary files etc.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira