[
https://issues.apache.org/jira/browse/HADOOP-15317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16405479#comment-16405479
]
Ajay Kumar commented on HADOOP-15317:
-------------------------------------
[~xiaochen], thanks for updating the patch. I think we can handle slowness in
best case for #4 while maintaining equal probability of available nodes by
checking if first random int in available leafs is excluded or not.
{code}
int nthValidToReturn = r.nextInt(parentNode.getNumOfLeaves());
LOG.debug("nthValidToReturn is {}", nthValidToReturn);
if (nthValidToReturn < 0) {
return null;
}
Node ret = null;
ret = parentNode.getLeaf(nthValidToReturn, excludedScopeNode);
if (!excludedNodes.contains(ret)) {
return ret;
}
Node lastValidNode = null;
nthValidToReturn = r.nextInt(availableNodes);{code}
Few comments on patch v2:
* L487 {{testChooseRandomInclude1}} excluded node dataNodes[7] (in "/d2/r3")
is outside the scope of our search "/d1". Not sure if it is intentional but i
think we can safely remove it as it is not used in the test flow. Please
correct me if my understanding is not correct.
* L485-L487 {{testChooseRandomInclude1}}, L511-514 testChooseRandomInclude2:
Shall we randomly select excluded nodes.
{code}
Random r = new Random();
r.nextInt(5);
excludedNodes.add(dataNodes[r.nextInt(5)]);
excludedNodes.add(dataNodes[r.nextInt(5)]);
// excludedNodes.add(dataNodes[7]);
Map<Node, Integer> frequency = pickNodesAtRandom(1000, scope,
excludedNodes);
excludedNodes.parallelStream().forEach( node -> {
assertEquals(node.getName() + " should be exclude", 0,
frequency.get(node).intValue());
});
{code}
> Improve NetworkTopology chooseRandom's loop
> -------------------------------------------
>
> Key: HADOOP-15317
> URL: https://issues.apache.org/jira/browse/HADOOP-15317
> Project: Hadoop Common
> Issue Type: Bug
> Reporter: Xiao Chen
> Assignee: Xiao Chen
> Priority: Major
> Attachments: HADOOP-15317.01.patch, HADOOP-15317.02.patch
>
>
> Recently we found a postmortem case where the ANN seems to be in an infinite
> loop. From the logs it seems it just went through a rolling restart, and DNs
> are getting registered.
> Later the NN become unresponsive, and from the stacktrace it's inside a
> do-while loop inside {{NetworkTopology#chooseRandom}} - part of what's done
> in HDFS-10320.
> Going through the code and logs I'm not able to come up with any theory
> (thought about incorrect locking, or the Node object being modified outside
> of NetworkTopology, both seem impossible) why this is happening, but we
> should eliminate this loop.
> stacktrace:
> {noformat}
> Stack:
> java.util.HashMap.hash(HashMap.java:338)
> java.util.HashMap.containsKey(HashMap.java:595)
> java.util.HashSet.contains(HashSet.java:203)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:786)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:732)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:757)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:692)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:666)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:573)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:461)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:368)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:243)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:115)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4AdditionalDatanode(BlockManager.java:1596)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3599)
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:717)
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]