sodonnel commented on a change in pull request #881:
URL: https://github.com/apache/hadoop-ozone/pull/881#discussion_r420691343



##########
File path: 
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ReplicationManager.java
##########
@@ -512,25 +523,60 @@ private void handleUnderReplicatedContainer(final 
ContainerInfo container,
       if (source.size() > 0) {
         final int replicationFactor = container
             .getReplicationFactor().getNumber();
-        final int delta = replicationFactor - getReplicaCount(id, replicas);
+        // Want to check if the container is mis-replicated after considering
+        // inflight add and delete.
+        // Create a new list from source (healthy replicas minus pending 
delete)
+        List<DatanodeDetails> targetReplicas = new ArrayList<>(source);
+        // Then add any pending additions
+        targetReplicas.addAll(replicationInFlight);
+
+        int delta = replicationFactor - getReplicaCount(id, replicas);
+        final ContainerPlacementStatus placementStatus =
+            containerPlacement.validateContainerPlacement(
+                targetReplicas, replicationFactor);
+        final int misRepDelta = placementStatus.additionalReplicaRequired();
+        final int replicasNeeded
+            = delta < misRepDelta ? misRepDelta : delta;
+
         final List<DatanodeDetails> excludeList = replicas.stream()
             .map(ContainerReplica::getDatanodeDetails)
             .collect(Collectors.toList());
+        excludeList.addAll(replicationInFlight);
         List<InflightAction> actionList = inflightReplication.get(id);
         if (actionList != null) {
           actionList.stream().map(r -> r.datanode)
               .forEach(excludeList::add);
         }
         final List<DatanodeDetails> selectedDatanodes = containerPlacement
-            .chooseDatanodes(excludeList, null, delta,
+            .chooseDatanodes(excludeList, null, replicasNeeded,
                 container.getUsedBytes());
-
-        LOG.info("Container {} is under replicated. Expected replica count" +
-                " is {}, but found {}.", id, replicationFactor,
-            replicationFactor - delta);
-
-        for (DatanodeDetails datanode : selectedDatanodes) {
-          sendReplicateCommand(container, datanode, source);
+        if (delta > 0) {
+          LOG.info("Container {} is under replicated. Expected replica count" +
+                  " is {}, but found {}.", id, replicationFactor,
+              replicationFactor - delta);
+        }
+        int newMisRepDelta = misRepDelta;
+        if (misRepDelta > 0) {
+          LOG.info("Container: {}. {}",
+              id, placementStatus.misReplicatedReason());
+          // Check if the new target nodes (original plus newly selected nodes)
+          // makes the placement policy valid.
+          targetReplicas.addAll(selectedDatanodes);
+          newMisRepDelta = containerPlacement.validateContainerPlacement(
+              targetReplicas, replicationFactor).additionalReplicaRequired();

Review comment:
       The nodemanager + topology can easily tell us the total racks every 
registered since SCM was started. But if you have a case where the cluster had 
2 racks and than a rack has stopped, that is not so easy to see. You would need 
to filter all the non-healthy nodes, which is basically what the placement 
policy does. Then it falls back to give you a non-placement valid node if it 
needs to. It is important it does this, as it needs to handle under-replication 
too via the same chooseNodes methods.
   
   The whole area around the fallback needs further thought in general. If we 
fall to 1 rack, on a cluster that had 2 racks, then its safe to assume the 
other rack should come back, and until it does the containers are 
mis-replicated (and probably under-replicated too) and should be reported as 
such.
   
   In both cases a warning would be written to the logs stating:
   
   ```
   LOG.warn("Container {} is mis-replicated, requiring {} additional " +
                 "replicas. After selecting new nodes, mis-replication has not 
" +
                 "improved. No additional replicas will be scheduled",
                 id, misRepDelta);
   ```
   This is really an edge case (2 rack cluster going into 1 rack) and I think 
we should leave this as it is for now as there are surely other things need to 
be thought about here in the wider network topology logic.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to