Name-node falls into infinite loop trying to remove a dead node.
----------------------------------------------------------------

                 Key: HADOOP-1255
                 URL: https://issues.apache.org/jira/browse/HADOOP-1255
             Project: Hadoop
          Issue Type: Bug
          Components: dfs
    Affects Versions: 0.12.3
            Reporter: Konstantin Shvachko
             Fix For: 0.13.0


Under certain conditions the name-node fall into infinite loop in 
heartbeatCheck().
It's rather hard to reproduce. I'm running one node cluster: 1 name-node, 1 
data-node.
The data-node dies, and 10 minutes later I get

07/04/12 10:40:34 INFO net.NetworkTopology: Removing a node: 
/default-rack/0.0.0.0:50077
07/04/12 10:44:35 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost 
heartbeat from 0.0.0.0:50077
...................................................
07/04/12 10:45:17 INFO net.NetworkTopology: Removing a node: 
/default-rack/0.0.0.0:50077
07/04/12 10:47:44 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost 
heartbeat from 0.0.0.0:50077

Here is what I see in the debugger:
FSNamesystem.heartbeats contains 2 identical (same instance) DatanodeDescriptor 
entries, both have
DatanodeDescriptor.isAlive = false. The heartbeatCheck() correctly detects that 
there is a dead node in
the list, but removeDatanode() does not delete the node from the heartbeats 
because it is dead.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to