[jira] [Work logged] (HDFS-16333) fix balancer bug when transfer an EC block

ASF GitHub Bot (Jira) Tue, 07 Dec 2021 03:58:08 -0800


     [ 
https://issues.apache.org/jira/browse/HDFS-16333?focusedWorklogId=691687&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-691687
 ]


ASF GitHub Bot logged work on HDFS-16333:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 07/Dec/21 11:57
            Start Date: 07/Dec/21 11:57
    Worklog Time Spent: 10m 
      Work Description: liubingxing commented on a change in pull request #3679:
URL: https://github.com/apache/hadoop/pull/3679#discussion_r763923048



##########
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancer.java
##########
@@ -1615,6 +1619,92 @@ public void testBalancerWithStripedFile() throws 
Exception {
     NameNodeConnector.setWrite2IdFile(false);
   }
 
+  @Test
+  public void testBalancerWithExcludeListWithStripedFile() throws Exception {
+    Configuration conf = new Configuration();
+    initConfWithStripe(conf);
+    NameNodeConnector.setWrite2IdFile(true);
+    doTestBalancerWithExcludeListWithStripedFile(conf);
+    NameNodeConnector.setWrite2IdFile(false);
+  }
+
+  private void doTestBalancerWithExcludeListWithStripedFile(Configuration 
conf) throws Exception {
+    int numOfDatanodes = dataBlocks + parityBlocks + 3;
+    int numOfRacks = dataBlocks;
+    long capacity = 20 * defaultBlockSize;
+    long[] capacities = new long[numOfDatanodes];
+    Arrays.fill(capacities, capacity);
+    String[] racks = new String[numOfDatanodes];
+    for (int i = 0; i < numOfDatanodes; i++) {
+      racks[i] = "/rack" + (i % numOfRacks);
+    }
+    cluster = new MiniDFSCluster.Builder(conf)
+        .numDataNodes(numOfDatanodes)
+        .racks(racks)
+        .simulatedCapacities(capacities)
+        .build();
+
+    try {
+      cluster.waitActive();
+      client = NameNodeProxies.createProxy(conf, 
cluster.getFileSystem(0).getUri(),
+          ClientProtocol.class).getProxy();
+      client.enableErasureCodingPolicy(
+          StripedFileTestUtil.getDefaultECPolicy().getName());
+      client.setErasureCodingPolicy("/",
+          StripedFileTestUtil.getDefaultECPolicy().getName());
+
+      long totalCapacity = sum(capacities);
+
+      // fill up the cluster with 30% data. It'll be 45% full plus parity.
+      long fileLen = totalCapacity * 3 / 10;
+      long totalUsedSpace = fileLen * (dataBlocks + parityBlocks) / dataBlocks;
+      FileSystem fs = cluster.getFileSystem(0);
+      DFSTestUtil.createFile(fs, filePath, fileLen, (short) 3, r.nextLong());
+
+      // verify locations of striped blocks
+      LocatedBlocks locatedBlocks = client.getBlockLocations(fileName, 0, 
fileLen);
+      StripedFileTestUtil.verifyLocatedStripedBlocks(locatedBlocks, groupSize);
+
+      // get datanode report
+      DatanodeInfo[] datanodeReport = 
client.getDatanodeReport(DatanodeReportType.ALL);
+
+      // add datanode in new rack
+      String newRack = "/rack" + (++numOfRacks);
+      cluster.startDataNodes(conf, 2, true, null,
+          new String[]{newRack, newRack}, null,
+          new long[]{capacity, capacity});
+      totalCapacity += capacity*2;
+      cluster.triggerHeartbeats();
+
+      // add datanode to exclude list
+      Set<String> dnList = new HashSet<>();
+      dnList.add(datanodeReport[0].getHostName());
+      BalancerParameters.Builder pBuilder = new BalancerParameters.Builder();
+      pBuilder.setExcludedNodes(dnList);
+      waitForHeartBeat(totalUsedSpace, totalCapacity, client, cluster);
+
+      // start balancer and check the failed num of moving task
+      Collection<URI> namenodes = DFSUtil.getInternalNsRpcUris(conf);
+      final int run = runBalancer(namenodes, pBuilder.build(), conf, true);
+      if (conf.getInt(
+          DFSConfigKeys.DFS_DATANODE_BALANCE_MAX_NUM_CONCURRENT_MOVES_KEY,
+          DFSConfigKeys.DFS_DATANODE_BALANCE_MAX_NUM_CONCURRENT_MOVES_DEFAULT)
+          == 0) {
+        assertEquals(ExitStatus.NO_MOVE_PROGRESS.getExitCode(), run);
+      } else {
+        assertEquals(ExitStatus.SUCCESS.getExitCode(), run);
+      }
+      waitForHeartBeat(totalUsedSpace, totalCapacity, client, cluster);
+
+      // verify locations of striped blocks
+      locatedBlocks = client.getBlockLocations(fileName, 0, fileLen);
+      StripedFileTestUtil.verifyLocatedStripedBlocks(locatedBlocks, groupSize);

Review comment:
       This is because `waitForBalancer` not waiting for namenode to delete 
extra replicas. 
   I fix the code and check the total block counts before 
`StripedFileTestUtil.verifyLocatedStripedBlocks`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 691687)
    Time Spent: 2h 20m  (was: 2h 10m)

> fix balancer bug when transfer an EC block
> ------------------------------------------
>
>                 Key: HDFS-16333
>                 URL: https://issues.apache.org/jira/browse/HDFS-16333
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: balancer &amp; mover
>            Reporter: qinyuren
>            Assignee: qinyuren
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2021-11-18-17-25-13-089.png, 
> image-2021-11-18-17-25-50-556.png, image-2021-11-18-17-28-03-155.png
>
>          Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> We set the EC policy to (6+3) and we also have nodes that were 
> decommissioning when we executed balancer.
> With the balancer running, we find many error logs as follow.
> !image-2021-11-18-17-25-13-089.png|width=858,height=135!
> Node A wants to transfer an EC block to node B, but we found that the block 
> is not on node A. The FSCK command to show the block status as follow
> !image-2021-11-18-17-25-50-556.png|width=607,height=189!
> In the dispatcher. getBlockList function
> !image-2021-11-18-17-28-03-155.png!
>  
> Assume that the location of the an EC block in storageGroupMap look like this
> indices:[0, 1, 2, 3, 4, 5, 6, 7, 8]
> node:[a, b, c, d, e, f, g, h, i]
> after decommission operation, the internal block on indices[1] were 
> decommission to another node.
> indices:[0, 1, 2, 3, 4, 5, 6, 7, 8]
> node:[a, {color:#FF0000}j{color}, c, d, e, f, g, h, i]
> the location of indices[1] change from node {color:#FF0000}b{color} to node 
> {color:#FF0000}j{color}.
>  
> When the balancer get the block location and check it with the location in 
> storageGroupMap.
> If a node is not found in storageGroupMap, it will not be add to block 
> locations.
> In this case, node {color:#FF0000}j {color}will not be added to the block 
> locations, while the indices is not updated.
> Finally, the block location may look like this, 
> indices:[0, 1, 2, 3, 4, 5, 6, 7, 8]
> {color:#FF0000}block.location:[a, c, d, e, f, g, h, i]{color}
> the location of the nodes does not match their indices
>  
> Solution:
> we should update the indices and match with the nodes
> {color:#FF0000}indices:[0, 2, 3, 4, 5, 6, 7, 8]{color}
> {color:#FF0000}block.location:[a, c, d, e, f, g, h, i]{color}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Work logged] (HDFS-16333) fix balancer bug when transfer an EC block

Reply via email to