Xiao Chen created HADOOP-15317:
----------------------------------
Summary: Improve NetworkTopology chooseRandom's loop
Key: HADOOP-15317
URL: https://issues.apache.org/jira/browse/HADOOP-15317
Project: Hadoop Common
Issue Type: Bug
Reporter: Xiao Chen
Assignee: Xiao Chen
Recently we found a postmortem case where the ANN seems to be in an infinite
loop. From the logs it seems it just went through a rolling restart, and DNs
are getting registered.
Later the NN become unresponsive, and from the stacktrace it's inside a
do-while loop inside {{NetworkTopology#chooseRandom}} - part of what's done in
HDFS-10320.
Going through the code and logs I'm not able to come up with any theory
(thought about incorrect locking, or the Node object being modified outside of
NetworkTopology, both seem impossible) why this is happening, but we should
eliminate this loop.
stacktrace:
{noformat}
Stack:
java.util.HashMap.hash(HashMap.java:338)
java.util.HashMap.containsKey(HashMap.java:595)
java.util.HashSet.contains(HashSet.java:203)
org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:786)
org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:732)
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:757)
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:692)
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:666)
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:573)
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:461)
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:368)
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:243)
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:115)
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4AdditionalDatanode(BlockManager.java:1596)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3599)
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:717)
{noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]