[
https://issues.apache.org/jira/browse/MAPREDUCE-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13132369#comment-13132369
]
Devaraj K commented on MAPREDUCE-3070:
--------------------------------------
Thanks Arun and Kamesh for taking look into the patch.
bq. I think this can be simplified. We don't need 'reconnected' state.
It can be simplified, will update the patch with simplified approach.
{quote}
Essentially an NM should be identified with host+port (see NodeId.hashCode).
Now on registration we can assume that host+port is unique - now the question
is: why isn't this already working?
{quote}
{code:title=ResourceTrackerService.java|borderStyle=solid}
if (this.rmContext.getRMNodes().putIfAbsent(nodeId, rmNode) != null) {
throw new IOException("Duplicate registration from the node!");
}
{code}
If the node manager goes down, it will be removed from the
this.rmContext.getRMNodes() after completion of the expiry
interval(YarnConfiguration.RM_NM_EXPIRY_INTERVAL_MS). Before completion of
expiry interval if the same node manager comes up in the same port, RM throws
IO exception saying "Duplicate registration from the node!" and NM fails to
start with the same reason.
bq. But I agree with Kamesh's observation on MAPREDUCE-3178, we need to fix
that as he pointed out.
It can be handled, will handle in the next patch.
bq. But this should already work if the NM comes up on a different port?
Yes, It works fine.
> NM not able to register with RM after NM restart
> ------------------------------------------------
>
> Key: MAPREDUCE-3070
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3070
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2, nodemanager
> Affects Versions: 0.23.0
> Reporter: Ravi Teja Ch N V
> Assignee: Devaraj K
> Priority: Blocker
> Fix For: 0.23.0
>
> Attachments: MAPREDUCE-3070.patch
>
>
> After stopping NM gracefully then starting NM, NM registration fails with RM
> with Duplicate registration from the node! error.
> {noformat}
> 2011-09-23 01:50:46,705 FATAL nodemanager.NodeManager
> (NodeManager.java:main(204)) - Error starting NodeManager
> org.apache.hadoop.yarn.YarnException: Failed to Start
> org.apache.hadoop.yarn.server.nodemanager.NodeManager
> at
> org.apache.hadoop.yarn.service.CompositeService.start(CompositeService.java:78)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.start(NodeManager.java:153)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:202)
> Caused by: org.apache.avro.AvroRuntimeException:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl:
> Duplicate registration from the node!
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.start(NodeStatusUpdaterImpl.java:141)
> at
> org.apache.hadoop.yarn.service.CompositeService.start(CompositeService.java:68)
> ... 2 more
> Caused by:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl:
> Duplicate registration from the node!
> at
> org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:142)
> at $Proxy13.registerNodeManager(Unknown Source)
> at
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:59)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:175)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.start(NodeStatusUpdaterImpl.java:137)
> ... 3 more
> {noformat}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira