[jira] [Updated] (MAPREDUCE-3034) NM should act on a REBOOT command from RM

Devaraj K (Updated) (JIRA) Wed, 01 Feb 2012 04:07:24 -0800

     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Devaraj K updated MAPREDUCE-3034:
---------------------------------

    Attachment: MAPREDUCE-3034-1.patch

{quote}
* Major:
** {{NodeManager.stateChanged()}}:
*** In the case of {{STATE.STOPPED}} *both* {{stop()}} and {{reboot()}} need to 
be called.
*** I don't see anywhere in the call hierarchy where {{NodeManager.stop()}} is 
ever called in the reboot case.
*** Following the logic through, it looks like this is the call sequence from 
your patch:
**** {{NodeStatusUpdaterImpl.startStatusUpdater().run()}}
***** calls: {{NodeStatusUpdaterImpl.reboot()}}
***** calls: {{AbstractService.stop()}}
***** calls: {{AbstractService.changeState()}}
***** calls: {{NodeManager.stateChanged()}}
***** Which *does not* call {{NodeManager.stop()}} if 
{{NodeStatusUpdaterImpl.isRebooted}} is set.
{quote}
stop() was calling from reboot method. Anyway now I have moved it to common for 
SHUTDOWN and REBOOT cases.

{quote}

** {{NodeManager}}:
*** In order for the new private static variable {{nodeManager}} to be fully 
effective, you should take out the local {{NodeManager}} declaration within 
{{main()}}. Otherwise, the new {{getNodeManager()}} method won't always return 
the NodeManager instance.
{quote}
It was added only for reboot testability purpose. Now I have refactored the 
code.


{quote}
** Please look into the new findbugs and core test failures.
{quote}
These test failures are not related to this patch.

{quote}
* Minor:
** {{NodeManager.main()}} / {{NodeManager.reboot()}}:
*** Rather than copy the same code from {{main()}} into the new {{reboot()}} 
method, I would create a common method (call it {{commonNMMain}}, for example). 
You will probably need a parameter flag to tell you whether or not this is 
called from {{main()}} or from {{reboot()}}.
{quote}
I have refactored the code accordingly.

{quote}
** {{NodeManager.reboot()}}:
*** You should handle the case where {{nodeManagerShutdownHook}} is {{null}}. 
Even though it should not be {{null}}, if it is {{null}}, you should create a 
new instance of it.
{quote}
There will not be a case where it will be null. This condition is added only 
for test purpose.

{quote}
** {{NodeManager.reboot()}}:
*** Rather than calling {{getConfig()}} when calling {{nodeManager.init()}}, I 
would argue that it would be better to create a new instance of conf so that 
any updates to the NM configs will be picked up. I guess it could be argued the 
other way, too. That is, you could make the case that the grid owner doesn't 
want the NM configs to change just because the RM was restarted. However, I 
think it should be a new NM config because the RM will have a new config.
{quote}
I don't feel it is a good idea to reload the configuration when the NM gets 
rebooted. If the user want to update any property then they should restart the 
NM manually.

{quote}
** {{NodeStatusUpdaterImpl}}:
*** {{YarnRPC rpc;}} is declared but never used.
{quote}
It was my mistake, I was doing some thing and forgot to remove. I have updated 
now.

{quote}
** {{TestNodeStatusUpdater}}:
*** I don't see where it is testing the RM-restart / NM-reboot case.
{quote}
This test case verifies that when NM gets reboot signal, current instance will 
be stopped and new instance will be started successfully.
                
> NM should act on a REBOOT command from RM
> -----------------------------------------
>
>                 Key: MAPREDUCE-3034
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3034
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Devaraj K
>            Priority: Critical
>         Attachments: MAPREDUCE-3034-1.patch, MAPREDUCE-3034.patch, MR-3034.txt
>
>
> RM sends a reboot command to NM in some cases, like when it gets lost and 
> rejoins back. In such a case, NM should act on the command and 
> reboot/reinitalize itself.
> This is akin to TT reinitialize on order from JT. We will need to shutdown 
> all the services properly and reinitialize - this should automatically take 
> care of killing of containers, cleaning up local temporary files etc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3034) NM should act on a REBOOT command from RM

Reply via email to