[jira] [Commented] (HADOOP-15317) Improve NetworkTopology chooseRandom's loop

Ajay Kumar (JIRA) Mon, 19 Mar 2018 14:28:40 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-15317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16405479#comment-16405479
 ]


Ajay Kumar commented on HADOOP-15317:
-------------------------------------

[~xiaochen], thanks for updating the patch. I think we can handle slowness in 
best case for #4 while maintaining equal probability of available nodes by 
checking if first random int in available leafs is excluded or not.
{code}
    int nthValidToReturn = r.nextInt(parentNode.getNumOfLeaves());
    LOG.debug("nthValidToReturn is {}", nthValidToReturn);
    if (nthValidToReturn < 0) {
      return null;
    }
    Node ret = null;
    ret = parentNode.getLeaf(nthValidToReturn, excludedScopeNode);
    if (!excludedNodes.contains(ret)) {
      return ret;
    }

    Node lastValidNode = null;
    nthValidToReturn = r.nextInt(availableNodes);{code}

Few comments on patch v2:
* L487 {{testChooseRandomInclude1}} excluded node dataNodes[7] (in  "/d2/r3") 
is outside the scope of our search "/d1". Not sure if it is intentional but i 
think we can safely remove it as it is not used in the test flow. Please 
correct me if my understanding is not correct.
* L485-L487 {{testChooseRandomInclude1}}, L511-514 testChooseRandomInclude2: 
Shall we randomly select excluded nodes.
{code}
 Random r = new Random();
    r.nextInt(5);
    excludedNodes.add(dataNodes[r.nextInt(5)]);
    excludedNodes.add(dataNodes[r.nextInt(5)]);
   // excludedNodes.add(dataNodes[7]);
    Map<Node, Integer> frequency = pickNodesAtRandom(1000, scope,
        excludedNodes);
    excludedNodes.parallelStream().forEach( node -> {
      assertEquals(node.getName() + " should be exclude", 0,
          frequency.get(node).intValue());
    });
{code} 

> Improve NetworkTopology chooseRandom's loop
> -------------------------------------------
>
>                 Key: HADOOP-15317
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15317
>             Project: Hadoop Common
>          Issue Type: Bug
>            Reporter: Xiao Chen
>            Assignee: Xiao Chen
>            Priority: Major
>         Attachments: HADOOP-15317.01.patch, HADOOP-15317.02.patch
>
>
> Recently we found a postmortem case where the ANN seems to be in an infinite 
> loop. From the logs it seems it just went through a rolling restart, and DNs 
> are getting registered.
> Later the NN become unresponsive, and from the stacktrace it's inside a 
> do-while loop inside {{NetworkTopology#chooseRandom}} - part of what's done 
> in HDFS-10320.
> Going through the code and logs I'm not able to come up with any theory 
> (thought about incorrect locking, or the Node object being modified outside 
> of NetworkTopology, both seem impossible) why this is happening, but we 
> should eliminate this loop.
> stacktrace:
> {noformat}
>  Stack:
> java.util.HashMap.hash(HashMap.java:338)
> java.util.HashMap.containsKey(HashMap.java:595)
> java.util.HashSet.contains(HashSet.java:203)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:786)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:732)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:757)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:692)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:666)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:573)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:461)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:368)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:243)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:115)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4AdditionalDatanode(BlockManager.java:1596)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3599)
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:717)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-15317) Improve NetworkTopology chooseRandom's loop

Reply via email to