Allow restarted NM to rejoin cluster before RM expires it
---------------------------------------------------------

                 Key: MAPREDUCE-3730
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3730
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: mrv2, resourcemanager
    Affects Versions: 0.23.1, 0.24.0
            Reporter: Jason Lowe
            Assignee: Jason Lowe


When a node in the RUNNING state (healthy or unhealthy) is rebooted, the 
resourcemanager rejects the nodemanager's registration request as a duplicate 
because it is convinced that the nodemanager is already running on that node.  
It won't allow that node to rejoin the cluster until the node expiration time 
elapses which is 10min+ by default.  We should allow the NM to rejoin the 
cluster if it re-registers within the expiration timeout.

Note that this problem occurs with NMs that are configured to specific ports.  
If ephemeral ports are used then a NM reboot "works" because the RM thinks the 
NM registration is for a new node.  See the discussions in MAPREDUCE-3070 and 
MAPREDUCE-3363.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to