sodonnel commented on code in PR #4006:
URL: https://github.com/apache/ozone/pull/4006#discussion_r1041054745


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/SCMCommonPlacementPolicy.java:
##########
@@ -426,4 +451,67 @@ public boolean isValidNode(DatanodeDetails datanodeDetails,
     }
     return false;
   }
+
+  /**
+   * Given a set of replicas of a container which are
+   * neither over underreplicated nor overreplicated,
+   * return a set of replicas to copy to another node to fix misreplication.
+   * @param replicas
+   */
+  @Override
+  public Set<ContainerReplica> replicasToCopyToFixMisreplication(
+         Set<ContainerReplica> replicas) {
+    Map<Node, List<ContainerReplica>> placementGroupReplicaIdMap
+            = replicas.stream().collect(Collectors.groupingBy(replica ->
+            this.getPlacementGroup(replica.getDatanodeDetails())));
+
+    int totalNumberOfReplicas = replicas.size();
+    int requiredNumberOfPlacementGroups =
+            getRequiredRackCount(totalNumberOfReplicas);
+    int additionalNumberOfRacksRequired = Math.max(
+            requiredNumberOfPlacementGroups - 
placementGroupReplicaIdMap.size(),
+            0);
+    int replicasPerPlacementGroup =
+            getMaxReplicasPerRack(totalNumberOfReplicas);
+    Set<ContainerReplica> copyReplicaSet = Sets.newHashSet();
+
+    for (List<ContainerReplica> replicaList: placementGroupReplicaIdMap
+            .values()) {
+      if (replicaList.size() > replicasPerPlacementGroup) {
+        List<ContainerReplica> replicasToBeCopied = replicaList.stream()
+                .limit(replicaList.size() - replicasPerPlacementGroup)
+                .collect(Collectors.toList());
+        copyReplicaSet.addAll(replicasToBeCopied);
+        replicaList.removeAll(replicasToBeCopied);
+      }
+    }
+    if (additionalNumberOfRacksRequired > copyReplicaSet.size()) {

Review Comment:
   These are the optimal distributions for the 3-2 and 6-3:
   
   ```
   3-2
   
   2 Racks = 3, 2
   3 Racks = 2, 2, 1
   4 racks = 2, 1, 1, 1
   
   
   6-3
   
   2 Racks = 5, 4
   3 Racks = 3, 3, 3
   4 Racks = 3, 3, 2, 1
   5 Racks = 2, 2, 2, 2, 1
   6 Racks = 2, 2, 2, 1, 1, 1
   7 Racks = 2, 2, 1, 1, 1, 1, 1
   8 Racks = 2, 1, 1, 1, 1, 1, 1, 1
   ```
   
   
   If you always start from the largest rack and address it first, and then 
decend down the racks, you take away at most the ideal replicas per rack from 
the previous level and you reduce the racks needed by 1. As you divide and 
round up, you are always allocating the ideal number to that rack. If there are 
no excess replicas above that number, we copy nothing and move onto the next 
rack.
   
   I have tried to think through scenarios where this might now work, but I 
have not been able to come up with any.
   
   I also don't think we would copy more than we need to, as we are always 
simply pruning the largest rack to reduce its number to the most it can be 
allowed to have, while maintaining the replicas per rack we expect.
   
   There may also be some ways of exiting early from the checks. Eg if we know 
we have X racks and we need X + 1, as soon as we find one replica to copy we 
will have increased the racks. However, I am not sure about it, as we may have 
a non-ideal distribution.
   
   I also think my while loop should be `while (orderedRacks.size() > 0) {` so 
it always checks all racks, as if we had 6-3 with only 2 racks now, so 5 and 4 
replicas each, and we want to get to 4 racks, we need to copy 2 from the first 
and 1 from the second to get to 3, 3, 2, 1 per rack.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to