[ 
https://issues.apache.org/jira/browse/HADOOP-15317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416333#comment-16416333
 ] 

Xiao Chen commented on HADOOP-15317:
------------------------------------

Thanks for looking at the test failures, definitely related.

[^HADOOP-15317.04.patch] to improve the first random choose logic - we need 
this random to be from all nodes, not available nodes, so it does not have 
correlation with the nthValidToReturn. (i.e. getLeaf(nthValidToReturn) won't 
provide even distribution).


 Also tuned the test logging / assertion message, in a way that it's obvious 
which node failed the assertion. It should be straightforward to search in the 
logs and find out what the exclude nodes and results are.

> Improve NetworkTopology chooseRandom's loop
> -------------------------------------------
>
>                 Key: HADOOP-15317
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15317
>             Project: Hadoop Common
>          Issue Type: Bug
>            Reporter: Xiao Chen
>            Assignee: Xiao Chen
>            Priority: Major
>         Attachments: HADOOP-15317.01.patch, HADOOP-15317.02.patch, 
> HADOOP-15317.03.patch, HADOOP-15317.04.patch
>
>
> Recently we found a postmortem case where the ANN seems to be in an infinite 
> loop. From the logs it seems it just went through a rolling restart, and DNs 
> are getting registered.
> Later the NN become unresponsive, and from the stacktrace it's inside a 
> do-while loop inside {{NetworkTopology#chooseRandom}} - part of what's done 
> in HDFS-10320.
> Going through the code and logs I'm not able to come up with any theory 
> (thought about incorrect locking, or the Node object being modified outside 
> of NetworkTopology, both seem impossible) why this is happening, but we 
> should eliminate this loop.
> stacktrace:
> {noformat}
>  Stack:
> java.util.HashMap.hash(HashMap.java:338)
> java.util.HashMap.containsKey(HashMap.java:595)
> java.util.HashSet.contains(HashSet.java:203)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:786)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:732)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:757)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:692)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:666)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:573)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:461)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:368)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:243)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:115)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4AdditionalDatanode(BlockManager.java:1596)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3599)
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:717)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to