[
https://issues.apache.org/jira/browse/MAPREDUCE-3730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Lowe updated MAPREDUCE-3730:
----------------------------------
Attachment: MAPREDUCE-3730.patch
Adding a patch that is a variation of the initial patch in MAPREDUCE-3070.
There is one race condition I can think of that isn't addressed, and that's the
case where the NM registers with the RM just as the node expiration occurs. If
the expiration is processed after the reconnect then the node will still be
marked lost. When the node heartbeats back in, the RM will direct it to
reboot, and the NM will simply shutdown.
Once MAPREDUCE-3034 is addressed, the NM will restart from the reboot directive
and the node will recover.
> Allow restarted NM to rejoin cluster before RM expires it
> ---------------------------------------------------------
>
> Key: MAPREDUCE-3730
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3730
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: mrv2, resourcemanager
> Affects Versions: 0.23.1, 0.24.0
> Reporter: Jason Lowe
> Assignee: Jason Lowe
> Attachments: MAPREDUCE-3730.patch
>
>
> When a node in the RUNNING state (healthy or unhealthy) is rebooted, the
> resourcemanager rejects the nodemanager's registration request as a duplicate
> because it is convinced that the nodemanager is already running on that node.
> It won't allow that node to rejoin the cluster until the node expiration
> time elapses which is 10min+ by default. We should allow the NM to rejoin
> the cluster if it re-registers within the expiration timeout.
> Note that this problem occurs with NMs that are configured to specific ports.
> If ephemeral ports are used then a NM reboot "works" because the RM thinks
> the NM registration is for a new node. See the discussions in MAPREDUCE-3070
> and MAPREDUCE-3363.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira