Xiao Chen created HADOOP-15317:
----------------------------------

             Summary: Improve NetworkTopology chooseRandom's loop
                 Key: HADOOP-15317
                 URL: https://issues.apache.org/jira/browse/HADOOP-15317
             Project: Hadoop Common
          Issue Type: Bug
            Reporter: Xiao Chen
            Assignee: Xiao Chen


Recently we found a postmortem case where the ANN seems to be in an infinite 
loop. From the logs it seems it just went through a rolling restart, and DNs 
are getting registered.

Later the NN become unresponsive, and from the stacktrace it's inside a 
do-while loop inside {{NetworkTopology#chooseRandom}} - part of what's done in 
HDFS-10320.

Going through the code and logs I'm not able to come up with any theory 
(thought about incorrect locking, or the Node object being modified outside of 
NetworkTopology, both seem impossible) why this is happening, but we should 
eliminate this loop.

stacktrace:
{noformat}
 Stack:
java.util.HashMap.hash(HashMap.java:338)
java.util.HashMap.containsKey(HashMap.java:595)
java.util.HashSet.contains(HashSet.java:203)
org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:786)
org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:732)
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:757)
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:692)
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:666)
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:573)
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:461)
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:368)
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:243)
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:115)
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4AdditionalDatanode(BlockManager.java:1596)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3599)
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:717)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Reply via email to