Re: [PR] HDDS-10462. Fail Datanode Decommission Early [ozone]

via GitHub Wed, 20 Mar 2024 23:09:56 -0700


siddhantsangwan commented on code in PR #6367:
URL: https://github.com/apache/ozone/pull/6367#discussion_r1533252567



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeDecommissionManager.java:
##########
@@ -368,6 +382,50 @@ public synchronized void startDecommission(DatanodeDetails 
dn)
     }
   }
 
+  private synchronized boolean 
checkIfDecommissionPossible(List<DatanodeDetails> dns, List<DatanodeAdminError> 
errors) {
+    // do we require method synchronization?
+    int minInService = -1; // maxRatis = -1, maxEc = -1;
+    for (DatanodeDetails dn : dns) {

Review Comment:
   We also need to check that the Datanodes being decommissioned are initially 
in-service only and not any other state. We'd have to account for any non 
in-service DNs or DNs that we're getting `NodeNotFoundException` for in our 
calculation also. 



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeDecommissionManager.java:
##########
@@ -368,6 +382,50 @@ public synchronized void startDecommission(DatanodeDetails 
dn)
     }
   }
 
+  private synchronized boolean 
checkIfDecommissionPossible(List<DatanodeDetails> dns, List<DatanodeAdminError> 
errors) {
+    // do we require method synchronization?
+    int minInService = -1; // maxRatis = -1, maxEc = -1;
+    for (DatanodeDetails dn : dns) {

Review Comment:
   We need to check that the Datanode being decommissioned are in-service only. 
Need to account for non in-service datanodes and datanodes that we're getting 
`NodeNotFoundException` for in our calculation as well.



##########
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/node/TestNodeDecommissionManager.java:
##########
@@ -56,17 +75,50 @@ public class TestNodeDecommissionManager {
   private NodeDecommissionManager decom;
   private StorageContainerManager scm;
   private NodeManager nodeManager;
+  private ContainerManager containerManager;
   private OzoneConfiguration conf;
+  @TempDir
+  private File testDir;
+  private DBStore dbStore;
+  private SCMHAManager scmhaManager;
+  private SequenceIdGenerator sequenceIdGen;
+  private ContainerReplicaPendingOps pendingOpsMock;
 
   @BeforeEach
   void setup(@TempDir File dir) throws Exception {
     conf = new OzoneConfiguration();
     conf.set(HddsConfigKeys.OZONE_METADATA_DIRS, dir.getAbsolutePath());
-    nodeManager = createNodeManager(conf);
-    decom = new NodeDecommissionManager(conf, nodeManager,
+    scm = HddsTestUtils.getScm(conf);
+    nodeManager = scm.getScmNodeManager();
+    final OzoneConfiguration ozConf = SCMTestUtils.getConf(testDir);
+    dbStore = DBStoreBuilder.createDBStore(
+        ozConf, new SCMDBDefinition());
+    scmhaManager = SCMHAManagerStub.getInstance(true);
+    sequenceIdGen = new SequenceIdGenerator(
+        ozConf, scmhaManager, SCMDBDefinition.SEQUENCE_ID.getTable(dbStore));
+    final PipelineManager pipelineManager =
+        new MockPipelineManager(dbStore, scmhaManager, nodeManager);
+    pipelineManager.createPipeline(RatisReplicationConfig.getInstance(
+        HddsProtos.ReplicationFactor.THREE));
+    pendingOpsMock = mock(ContainerReplicaPendingOps.class);
+    containerManager = new ContainerManagerImpl(ozConf,

Review Comment:
   I have not checked the test cases yet, but is it necessary to have the 
actual container manager implementation here? Can we not mock it instead? If we 
mock it, this test doesn't need to depend on the actual implementation.



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeDecommissionManager.java:
##########
@@ -368,6 +382,50 @@ public synchronized void startDecommission(DatanodeDetails 
dn)
     }
   }
 
+  private synchronized boolean 
checkIfDecommissionPossible(List<DatanodeDetails> dns, List<DatanodeAdminError> 
errors) {
+    // do we require method synchronization?
+    int minInService = -1; // maxRatis = -1, maxEc = -1;
+    for (DatanodeDetails dn : dns) {
+      Set<ContainerID> containers;
+      try {
+        containers = nodeManager.getContainers(dn);
+      } catch (NodeNotFoundException ex) {
+        LOG.warn("The host {} was not found in SCM. Ignoring the request to " +
+            "decommission it", dn.getHostName());
+        errors.add(new DatanodeAdminError(dn.getHostName(),
+            "The host was not found in SCM"));
+        continue; // ignore the DN and continue to next one
+      }
+      for (ContainerID cid : containers) {
+        ContainerInfo cif;
+        try {
+          cif = containerManager.getContainer(cid);
+        } catch (ContainerNotFoundException ex) {
+          continue; // ignore the container and continue to next one
+        }
+        if (cif.getState().equals(HddsProtos.LifeCycleState.DELETED) ||
+            cif.getState().equals(HddsProtos.LifeCycleState.DELETING)) {
+          continue;
+        }
+        int reqNodes = cif.getReplicationConfig().getRequiredNodes();
+        if (reqNodes > minInService) {
+          minInService = reqNodes;
+        }
+      /* *below code would check the replication type and then get the factor,
+        but as we have a simpler way i.e., getRequiredNodes(), I don't think 
we need to care about the replication type
+
+      HddsProtos.ReplicationType replicationType = cif.getReplicationType();
+      if (replicationType.equals(HddsProtos.ReplicationType.RATIS)) {
+        maxRatis = cif.getReplicationFactor().getNumber();
+      } else if (replicationType.equals(HddsProtos.ReplicationType.EC) {
+        //cif.getReplicationConfig();
+      }*/

Review Comment:
   Yes, `getRequiredNodes` is what we want.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-10462. Fail Datanode Decommission Early [ozone]

Reply via email to