[jira] [Work logged] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

ASF GitHub Bot (Jira) Thu, 07 Apr 2022 03:23:06 -0700


     [ 
https://issues.apache.org/jira/browse/HDFS-16456?focusedWorklogId=753952&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-753952
 ]


ASF GitHub Bot logged work on HDFS-16456:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 07/Apr/22 10:22
            Start Date: 07/Apr/22 10:22
    Worklog Time Spent: 10m 
      Work Description: tasanuma commented on code in PR #4126:
URL: https://github.com/apache/hadoop/pull/4126#discussion_r844954667


##########
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestBlockPlacementPolicyRackFaultTolerant.java:
##########
@@ -167,6 +179,107 @@ private void doTestChooseTargetSpecialCase() throws 
Exception {
     }
   }
 
+  /**
+   * Verify decommission a dn which is an only node in its rack.
+   */
+  public void testPlacementWithOnlyOneNodeInRackDecommission() throws 
Exception {

Review Comment:
   Instead of adding `testPlacementWithOnlyOneNodeInRackDecommission()` to 
`testChooseTarget()`, it would be better to create 
`testPlacementWithOnlyOneNodeInRackDecommission()` as one unit test method.
   ```suggestion
     @Test
     public void testPlacementWithOnlyOneNodeInRackDecommission() throws 
Exception {
   ```



##########
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/NetworkTopology.java:
##########
@@ -101,6 +101,13 @@ protected NetworkTopology init(InnerNode.Factory factory) {
   private int depthOfAllLeaves = -1;
   /** rack counter */
   protected int numOfRacks = 0;
+  /** empty rack map, rackname->nodenumber. */
+  private HashMap<String, Set<String>> emptyRackMap =

Review Comment:
   This is a rack map to calculate the number of empty racks. So I prefer to 
use the name `rackMap`.
   ```suggestion
     private HashMap<String, Set<String>> rackMap =
   ```



##########
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestBlockPlacementPolicyRackFaultTolerant.java:
##########
@@ -167,6 +179,107 @@ private void doTestChooseTargetSpecialCase() throws 
Exception {
     }
   }
 
+  /**
+   * Verify decommission a dn which is an only node in its rack.
+   */
+  public void testPlacementWithOnlyOneNodeInRackDecommission() throws 
Exception {
+    Configuration conf = new HdfsConfiguration();
+    final String[] racks = {"/RACK0", "/RACK0", "/RACK2", "/RACK3", "/RACK4", 
"/RACK5", "/RACK2"};
+    final String[] hosts = {"/host0", "/host1", "/host2", "/host3", "/host4", 
"/host5", "/host6"};
+
+    // enables DFSNetworkTopology
+    conf.setClass(DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_KEY,
+        BlockPlacementPolicyRackFaultTolerant.class,
+        BlockPlacementPolicy.class);
+    conf.setBoolean(DFSConfigKeys.DFS_USE_DFS_NETWORK_TOPOLOGY_KEY, true);
+    conf.setLong(DFSConfigKeys.DFS_BLOCK_SIZE_KEY, DEFAULT_BLOCK_SIZE);
+    conf.setInt(DFSConfigKeys.DFS_BYTES_PER_CHECKSUM_KEY,
+        DEFAULT_BLOCK_SIZE / 2);
+
+    if (cluster != null) {
+      cluster.shutdown();
+    }
+    cluster = new MiniDFSCluster.Builder(conf).numDataNodes(7).racks(racks)
+        .hosts(hosts).build();
+    cluster.waitActive();
+    nameNodeRpc = cluster.getNameNodeRpc();
+    namesystem = cluster.getNamesystem();
+    DistributedFileSystem fs = cluster.getFileSystem();
+    fs.enableErasureCodingPolicy("RS-3-2-1024k");
+    fs.setErasureCodingPolicy(new Path("/"), "RS-3-2-1024k");
+
+    final BlockManager bm = cluster.getNamesystem().getBlockManager();
+    final DatanodeManager dm = bm.getDatanodeManager();
+    assertTrue(dm.getNetworkTopology() instanceof DFSNetworkTopology);
+
+    String clientMachine = "/host4";
+    String clientRack = "/RACK4";
+    String src = "/test";
+
+    final DatanodeManager dnm = 
namesystem.getBlockManager().getDatanodeManager();
+    DatanodeDescriptor dnd4 = 
dnm.getDatanode(cluster.getDataNodes().get(4).getDatanodeId());
+    assertEquals(dnd4.getNetworkLocation(), clientRack);
+    dnm.getDatanodeAdminManager().startDecommission(dnd4);
+    short replication = 5;
+    short additionalReplication = 1;
+
+    try {
+      // Create the file with client machine
+      HdfsFileStatus fileStatus = namesystem.startFile(src, perm,
+          clientMachine, clientMachine, EnumSet.of(CreateFlag.CREATE), true,
+          replication, DEFAULT_BLOCK_SIZE * 1024 * 10, null, null, null, 
false);
+
+      //test chooseTarget for new file
+      LocatedBlock locatedBlock = nameNodeRpc.addBlock(src, clientMachine,
+          null, null, fileStatus.getFileId(), null, null);
+      HashMap<String, Integer> racksCount = new HashMap<String, Integer>();
+      doTestLocatedBlockRacks(racksCount, replication, 4, locatedBlock);
+
+      //test chooseTarget for existing file.
+      LocatedBlock additionalLocatedBlock =
+          nameNodeRpc.getAdditionalDatanode(src, fileStatus.getFileId(),
+              locatedBlock.getBlock(), locatedBlock.getLocations(),
+              locatedBlock.getStorageIDs(), DatanodeInfo.EMPTY_ARRAY,
+              additionalReplication, clientMachine);
+
+      racksCount.clear();
+      doTestLocatedBlockRacks(racksCount, additionalReplication + replication,
+          4, additionalLocatedBlock);
+      assertEquals(racksCount.get("/RACK0"), (Integer)2);
+      assertEquals(racksCount.get("/RACK2"), (Integer)2);
+    } finally {
+      dnm.getDatanodeAdminManager().stopDecommission(dnd4);
+    }
+
+    //test if decommission successed

Review Comment:
   ```suggestion
       //test if decommission succeeded
   ```



##########
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestBlockPlacementPolicyRackFaultTolerant.java:
##########
@@ -167,6 +179,107 @@ private void doTestChooseTargetSpecialCase() throws 
Exception {
     }
   }
 
+  /**
+   * Verify decommission a dn which is an only node in its rack.
+   */
+  public void testPlacementWithOnlyOneNodeInRackDecommission() throws 
Exception {
+    Configuration conf = new HdfsConfiguration();
+    final String[] racks = {"/RACK0", "/RACK0", "/RACK2", "/RACK3", "/RACK4", 
"/RACK5", "/RACK2"};
+    final String[] hosts = {"/host0", "/host1", "/host2", "/host3", "/host4", 
"/host5", "/host6"};
+
+    // enables DFSNetworkTopology
+    conf.setClass(DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_KEY,
+        BlockPlacementPolicyRackFaultTolerant.class,
+        BlockPlacementPolicy.class);
+    conf.setBoolean(DFSConfigKeys.DFS_USE_DFS_NETWORK_TOPOLOGY_KEY, true);
+    conf.setLong(DFSConfigKeys.DFS_BLOCK_SIZE_KEY, DEFAULT_BLOCK_SIZE);
+    conf.setInt(DFSConfigKeys.DFS_BYTES_PER_CHECKSUM_KEY,
+        DEFAULT_BLOCK_SIZE / 2);
+
+    if (cluster != null) {
+      cluster.shutdown();
+    }
+    cluster = new MiniDFSCluster.Builder(conf).numDataNodes(7).racks(racks)
+        .hosts(hosts).build();
+    cluster.waitActive();
+    nameNodeRpc = cluster.getNameNodeRpc();
+    namesystem = cluster.getNamesystem();
+    DistributedFileSystem fs = cluster.getFileSystem();
+    fs.enableErasureCodingPolicy("RS-3-2-1024k");
+    fs.setErasureCodingPolicy(new Path("/"), "RS-3-2-1024k");
+
+    final BlockManager bm = cluster.getNamesystem().getBlockManager();
+    final DatanodeManager dm = bm.getDatanodeManager();
+    assertTrue(dm.getNetworkTopology() instanceof DFSNetworkTopology);
+
+    String clientMachine = "/host4";
+    String clientRack = "/RACK4";
+    String src = "/test";
+
+    final DatanodeManager dnm = 
namesystem.getBlockManager().getDatanodeManager();
+    DatanodeDescriptor dnd4 = 
dnm.getDatanode(cluster.getDataNodes().get(4).getDatanodeId());
+    assertEquals(dnd4.getNetworkLocation(), clientRack);
+    dnm.getDatanodeAdminManager().startDecommission(dnd4);
+    short replication = 5;
+    short additionalReplication = 1;
+
+    try {
+      // Create the file with client machine
+      HdfsFileStatus fileStatus = namesystem.startFile(src, perm,
+          clientMachine, clientMachine, EnumSet.of(CreateFlag.CREATE), true,
+          replication, DEFAULT_BLOCK_SIZE * 1024 * 10, null, null, null, 
false);
+
+      //test chooseTarget for new file
+      LocatedBlock locatedBlock = nameNodeRpc.addBlock(src, clientMachine,
+          null, null, fileStatus.getFileId(), null, null);
+      HashMap<String, Integer> racksCount = new HashMap<String, Integer>();
+      doTestLocatedBlockRacks(racksCount, replication, 4, locatedBlock);
+
+      //test chooseTarget for existing file.
+      LocatedBlock additionalLocatedBlock =
+          nameNodeRpc.getAdditionalDatanode(src, fileStatus.getFileId(),
+              locatedBlock.getBlock(), locatedBlock.getLocations(),
+              locatedBlock.getStorageIDs(), DatanodeInfo.EMPTY_ARRAY,
+              additionalReplication, clientMachine);
+
+      racksCount.clear();
+      doTestLocatedBlockRacks(racksCount, additionalReplication + replication,
+          4, additionalLocatedBlock);
+      assertEquals(racksCount.get("/RACK0"), (Integer)2);
+      assertEquals(racksCount.get("/RACK2"), (Integer)2);
+    } finally {
+      dnm.getDatanodeAdminManager().stopDecommission(dnd4);
+    }
+
+    //test if decommission successed
+    DatanodeDescriptor dnd3 = 
dnm.getDatanode(cluster.getDataNodes().get(3).getDatanodeId());
+    cluster.getNamesystem().writeLock();
+    try {
+      dm.getDatanodeAdminManager().startDecommission(dnd3);
+    } finally {
+      cluster.getNamesystem().writeUnlock();
+    }
+
+    // make sure the decommission finishes and the block in on 6 racks

Review Comment:
   maybe
   ```suggestion
       // make sure the decommission finishes and the block in on 4 racks
   ```



##########
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/NetworkTopology.java:
##########
@@ -1015,4 +1024,108 @@ protected static boolean isNodeInScope(Node node, 
String scope) {
     String nodeLocation = NodeBase.getPath(node) + NodeBase.PATH_SEPARATOR_STR;
     return nodeLocation.startsWith(scope);
   }
-}
\ No newline at end of file
+
+  /** @return the number of nonempty racks */
+  public int getNumOfNonEmptyRacks() {
+    return numOfRacks - numOfEmptyRacks;
+  }
+
+  /**
+   * Update empty rack number when add a node like recommision.

Review Comment:
   ```suggestion
      * Update empty rack number when add a node like recommission.
   ```



##########
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/NetworkTopology.java:
##########
@@ -1015,4 +1024,108 @@ protected static boolean isNodeInScope(Node node, 
String scope) {
     String nodeLocation = NodeBase.getPath(node) + NodeBase.PATH_SEPARATOR_STR;
     return nodeLocation.startsWith(scope);
   }
-}
\ No newline at end of file
+
+  /** @return the number of nonempty racks */
+  public int getNumOfNonEmptyRacks() {
+    return numOfRacks - numOfEmptyRacks;
+  }
+
+  /**
+   * Update empty rack number when add a node like recommision.
+   * @param node node to be added; can be null
+   */
+  public void recommissionNode(Node node) {
+    if (node == null) {
+      return;
+    }
+    if (node instanceof InnerNode) {
+      throw new IllegalArgumentException(
+          "Not allow to remove an inner node: " + NodeBase.getPath(node));
+    }
+    netlock.writeLock().lock();
+    try {
+      decommissionNodes.remove(node.getName());
+      interAddNodeWithEmptyRack(node);
+    } finally {
+      netlock.writeLock().unlock();
+    }
+  }
+
+  /**
+   * Update empty rack number when remove a node like decommision.

Review Comment:
   ```suggestion
      * Update empty rack number when remove a node like decommission.
   ```





Issue Time Tracking
-------------------

    Worklog Id:     (was: 753952)
    Time Spent: 1h  (was: 50m)

> EC: Decommission a rack with only on dn will fail when the rack number is 
> equal with replication
> ------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-16456
>                 URL: https://issues.apache.org/jira/browse/HDFS-16456
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ec, namenode
>    Affects Versions: 3.4.0
>            Reporter: caozhiqiang
>            Assignee: caozhiqiang
>            Priority: Critical
>              Labels: pull-request-available
>         Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, 
> HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, 
> HDFS-16456.006.patch, HDFS-16456.007.patch, HDFS-16456.008.patch, 
> HDFS-16456.009.patch, HDFS-16456.010.patch
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason:
>  # Enable EC policy, such as RS-6-3-1024k.
>  # The rack number in this cluster is equal with or less than the replication 
> number(9)
>  # A rack only has one DN, and decommission this DN.
> The root cause is in 
> BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will 
> give a limit parameter maxNodesPerRack for choose targets. In this scenario, 
> the maxNodesPerRack is 1, which means each rack can only be chosen one 
> datanode.
> {code:java}
>   protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) {
>    ...
>     // If more replicas than racks, evenly spread the replicas.
>     // This calculation rounds up.
>     int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
>     return new int[] {numOfReplicas, maxNodesPerRack};
>   } {code}
> int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> here will be called, where totalNumOfReplicas=9 and  numOfRacks=9  
> When we decommission one dn which is only one node in its rack, the 
> chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() 
> will throw NotEnoughReplicasException, but the exception will not be caught 
> and fail to fallback to chooseEvenlyFromRemainingRacks() function.
> When decommission, after choose targets, verifyBlockPlacement() function will 
> return the total rack number contains the invalid rack, and 
> BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false 
> and it will also cause decommission fail.
> {code:java}
>   public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs,
>       int numberOfReplicas) {
>     if (locs == null)
>       locs = DatanodeDescriptor.EMPTY_ARRAY;
>     if (!clusterMap.hasClusterEverBeenMultiRack()) {
>       // only one rack
>       return new BlockPlacementStatusDefault(1, 1, 1);
>     }
>     // Count locations on different racks.
>     Set<String> racks = new HashSet<>();
>     for (DatanodeInfo dn : locs) {
>       racks.add(dn.getNetworkLocation());
>     }
>     return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas,
>         clusterMap.getNumOfRacks());
>   } {code}
> {code:java}
>   public boolean isPlacementPolicySatisfied() {
>     return requiredRacks <= currentRacks || currentRacks >= totalRacks;
>   }{code}
> According to the above description, we should make the below modify to fix it:
>  # In startDecommission() or stopDecommission(), we should also change the 
> numOfRacks in class NetworkTopology. Or choose targets may fail for the 
> maxNodesPerRack is too small. And even choose targets success, 
> isPlacementPolicySatisfied will also return false cause decommission fail.
>  # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first 
> chooseOnce() function should also be put in try..catch..., or it will not 
> fallback to call chooseEvenlyFromRemainingRacks() when throw exception.
>  # In verifyBlockPlacement, we need to remove invalid racks from total 
> numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail 
> to reconstruct data.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Work logged] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

Reply via email to