[
https://issues.apache.org/jira/browse/HADOOP-1255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488484
]
Hairong Kuang commented on HADOOP-1255:
---------------------------------------
After much investigation, I was able to reproduce the problem. This is caused
by the same datanode registers more than once. Each registeration puts the
datanodeDescriptor in the heartbeat queue. When the heartbeat queue has more
than one reference to the same DataNodeDescriptor and the datanode losts a
heartbeat, heartbeatCheck will get into an infinite loop.
This problem could be fixed either by doing a contains check before adding a
datanodeDescriptor to the heartbeat queue or using a collection type that
disallow duplicate entries for the heartbeat queue.
> Name-node falls into infinite loop trying to remove a dead node.
> ----------------------------------------------------------------
>
> Key: HADOOP-1255
> URL: https://issues.apache.org/jira/browse/HADOOP-1255
> Project: Hadoop
> Issue Type: Bug
> Components: dfs
> Affects Versions: 0.12.3
> Reporter: Konstantin Shvachko
> Fix For: 0.13.0
>
>
> Under certain conditions the name-node fall into infinite loop in
> heartbeatCheck().
> It's rather hard to reproduce. I'm running one node cluster: 1 name-node, 1
> data-node.
> The data-node dies, and 10 minutes later I get
> 07/04/12 10:40:34 INFO net.NetworkTopology: Removing a node:
> /default-rack/0.0.0.0:50077
> 07/04/12 10:44:35 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck:
> lost heartbeat from 0.0.0.0:50077
> ...................................................
> 07/04/12 10:45:17 INFO net.NetworkTopology: Removing a node:
> /default-rack/0.0.0.0:50077
> 07/04/12 10:47:44 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck:
> lost heartbeat from 0.0.0.0:50077
> Here is what I see in the debugger:
> FSNamesystem.heartbeats contains 2 identical (same instance)
> DatanodeDescriptor entries, both have
> DatanodeDescriptor.isAlive = false. The heartbeatCheck() correctly detects
> that there is a dead node in
> the list, but removeDatanode() does not delete the node from the heartbeats
> because it is dead.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.