fapifta commented on a change in pull request #881:
URL: https://github.com/apache/hadoop-ozone/pull/881#discussion_r420070985



##########
File path: 
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/SCMCommonPlacementPolicy.java
##########
@@ -198,4 +211,69 @@ public boolean hasEnoughSpace(DatanodeDetails 
datanodeDetails,
    */
   public abstract DatanodeDetails chooseNode(
       List<DatanodeDetails> healthyNodes);
+
+  /**
+   * Default implementation for basic placement policies that do not have a
+   * placement policy. If the policy has not network topology this method 
should
+   * return null.
+   * @return The networkTopology for the policy or null if none is configured.
+   */
+  protected NetworkTopology getNetworkTopology() {
+    return null;
+  }
+
+  /**
+   * Default implementation to return the number of racks containers should 
span
+   * to meet the placement policy. For simple policies that are not rack aware
+   * we return 1, from this default implementation.
+   * should have
+   * @return The number of racks containers should span to meet the policy
+   */
+  protected int getRequiredRackCount() {
+    return 1;
+  }
+
+  /**
+   * This default implementation handles rack aware policies and non rack
+   * aware policies. If a future placement policy needs to check more than 
racks
+   * to validate the policy (eg node groups, HDFS like upgrade domain) this
+   * method should be overridden in the sub class.
+   * This method requires that subclasses which implement rack aware policies
+   * override the default method getRequiredRackCount and getNetworkTopology.
+   * @param dns List of datanodes holding a replica of the container
+   * @param replicas The expected number of replicas
+   * @return ContainerPlacementStatus indicating if the placement policy is
+   *         met or not. Not this only considers the rack count and not the
+   *         number of replicas.
+   */
+  @Override
+  public ContainerPlacementStatus validateContainerPlacement(
+      List<DatanodeDetails> dns, int replicas) {
+    NetworkTopology topology = getNetworkTopology();
+    int requiredRacks = getRequiredRackCount();
+    if (topology == null || replicas == 1 || requiredRacks == 1) {
+      if (dns.size() > 0) {
+        // placement is always satisfied if there is at least one DN.
+        return validPlacement;
+      } else {
+        return invalidPlacement;
+      }
+    }

Review comment:
       This part of the method is clearly dealing with non-topology-aware 
stuff, and returns, while the rest is dealing with topology-aware stuff. Why we 
don't use polymorphism here to separate the two? I understand that this 
requires copying the topology-aware stuff to two places, but that is an other 
problem we have to deal with at one point.
   So I believe this if should be the default implementation in this class, and 
the rest should be inside the rack aware and pipeline placement policy.
   
   This would render the getRequiredRackCount to be unnecessary as well, as 
here it would be possible to just remove the expression yielding into true 
always for non-topology-aware policies, while the topology aware ones can have 
and use the constant in their implementation.

##########
File path: 
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ReplicationManager.java
##########
@@ -393,8 +396,11 @@ private boolean isContainerHealthy(final ContainerInfo 
container,
    */
   private boolean isContainerUnderReplicated(final ContainerInfo container,
       final Set<ContainerReplica> replicas) {
+    boolean misReplicated = !getPlacementStatus(

Review comment:
       I understand that this is here to check if we have less than 2 racks 
with replicas, and if so, consider the container to be under replicated.
   I am unsure though if we want to ensure that a container is replicated to 
exactly 2 racks, it does not seems to be the case in the rack aware placement 
policy, but if so, then mis-replication should be checked somewhere else also 
to handle the case when a container is replicated to 3 racks due to for example 
removing a rack from config and place the nodes in a rack to other racks. 
(Probably this can go to isContainerOverReplicated(), if we want to check for 
this case.)

##########
File path: 
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ReplicationManager.java
##########
@@ -512,25 +523,60 @@ private void handleUnderReplicatedContainer(final 
ContainerInfo container,
       if (source.size() > 0) {
         final int replicationFactor = container
             .getReplicationFactor().getNumber();
-        final int delta = replicationFactor - getReplicaCount(id, replicas);
+        // Want to check if the container is mis-replicated after considering
+        // inflight add and delete.
+        // Create a new list from source (healthy replicas minus pending 
delete)
+        List<DatanodeDetails> targetReplicas = new ArrayList<>(source);
+        // Then add any pending additions
+        targetReplicas.addAll(replicationInFlight);
+
+        int delta = replicationFactor - getReplicaCount(id, replicas);
+        final ContainerPlacementStatus placementStatus =
+            containerPlacement.validateContainerPlacement(
+                targetReplicas, replicationFactor);
+        final int misRepDelta = placementStatus.additionalReplicaRequired();

Review comment:
       This method in reality gives back the additional rack count required, 
not the additional replicas required.
   This is why we need to check against the replication factor and the replica 
count difference also, while we provide the replication factor to the 
validation method, so it would be able to return the number of replicas really 
required.
   
   Let see the different cases:
    - non-topology aware placement policies: if replica count == replication 
factor we are good, otherwise the required additional replica count is 
replication factor - replica count, and we need that many extra replicas later 
from chooseDataNodes.
    - a topology aware placement policy with replication factor of 1 in case it 
is under replicated we already miss the block
    - a topology aware placement policy with replication factor of 2 in case it 
is under replicated either we miss the block, or we have it in one rack, and 
chooseDatanodes will give us the required one in a different rack.
    - topology aware placement policies with replication factor of 3 (or more):
      - when we have 1 healthy replica, we need 2 additional replicas anyways, 
it is not really interesting in which rack.
      - when we have 2 healthy replica, we need 1 additional replica from 
either one of the racks in which we already have a replica or in an other rack 
instead if we have the two replica in the same rack, this the policy already 
handles as it uses the excluded node list to anchor to proper racks in 
chooseDatanodes().
      - when we have 3 healthy replica, but all three is in the same rack, then 
additionalReplicaRequired should yield to 1 still, and chooseDataNode should 
select a new set of DataNodes correctly, consisting of the nodes already have 
the replica, and one additional in a different rack. Which makes the container 
over replicated, so one excess replica will be removed later, after replication 
finished.
   
   So if the additionalReplicaRequired works as the name suggests, we should 
not need to check replication factor and replica count here, but just rely on 
the value given back by the method.

##########
File path: 
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ReplicationManager.java
##########
@@ -512,25 +523,60 @@ private void handleUnderReplicatedContainer(final 
ContainerInfo container,
       if (source.size() > 0) {
         final int replicationFactor = container
             .getReplicationFactor().getNumber();
-        final int delta = replicationFactor - getReplicaCount(id, replicas);
+        // Want to check if the container is mis-replicated after considering
+        // inflight add and delete.
+        // Create a new list from source (healthy replicas minus pending 
delete)
+        List<DatanodeDetails> targetReplicas = new ArrayList<>(source);
+        // Then add any pending additions
+        targetReplicas.addAll(replicationInFlight);
+
+        int delta = replicationFactor - getReplicaCount(id, replicas);
+        final ContainerPlacementStatus placementStatus =
+            containerPlacement.validateContainerPlacement(
+                targetReplicas, replicationFactor);
+        final int misRepDelta = placementStatus.additionalReplicaRequired();
+        final int replicasNeeded
+            = delta < misRepDelta ? misRepDelta : delta;
+
         final List<DatanodeDetails> excludeList = replicas.stream()
             .map(ContainerReplica::getDatanodeDetails)
             .collect(Collectors.toList());
+        excludeList.addAll(replicationInFlight);

Review comment:
       Do we add the in-flight replications twice here this way, or I am 
missing something?

##########
File path: 
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ReplicationManager.java
##########
@@ -512,25 +523,60 @@ private void handleUnderReplicatedContainer(final 
ContainerInfo container,
       if (source.size() > 0) {
         final int replicationFactor = container
             .getReplicationFactor().getNumber();
-        final int delta = replicationFactor - getReplicaCount(id, replicas);
+        // Want to check if the container is mis-replicated after considering
+        // inflight add and delete.
+        // Create a new list from source (healthy replicas minus pending 
delete)
+        List<DatanodeDetails> targetReplicas = new ArrayList<>(source);
+        // Then add any pending additions
+        targetReplicas.addAll(replicationInFlight);
+
+        int delta = replicationFactor - getReplicaCount(id, replicas);
+        final ContainerPlacementStatus placementStatus =
+            containerPlacement.validateContainerPlacement(
+                targetReplicas, replicationFactor);
+        final int misRepDelta = placementStatus.additionalReplicaRequired();
+        final int replicasNeeded
+            = delta < misRepDelta ? misRepDelta : delta;
+
         final List<DatanodeDetails> excludeList = replicas.stream()
             .map(ContainerReplica::getDatanodeDetails)
             .collect(Collectors.toList());
+        excludeList.addAll(replicationInFlight);
         List<InflightAction> actionList = inflightReplication.get(id);
         if (actionList != null) {
           actionList.stream().map(r -> r.datanode)
               .forEach(excludeList::add);
         }
         final List<DatanodeDetails> selectedDatanodes = containerPlacement
-            .chooseDatanodes(excludeList, null, delta,
+            .chooseDatanodes(excludeList, null, replicasNeeded,
                 container.getUsedBytes());
-
-        LOG.info("Container {} is under replicated. Expected replica count" +
-                " is {}, but found {}.", id, replicationFactor,
-            replicationFactor - delta);
-
-        for (DatanodeDetails datanode : selectedDatanodes) {
-          sendReplicateCommand(container, datanode, source);
+        if (delta > 0) {
+          LOG.info("Container {} is under replicated. Expected replica count" +
+                  " is {}, but found {}.", id, replicationFactor,
+              replicationFactor - delta);
+        }
+        int newMisRepDelta = misRepDelta;
+        if (misRepDelta > 0) {
+          LOG.info("Container: {}. {}",
+              id, placementStatus.misReplicatedReason());
+          // Check if the new target nodes (original plus newly selected nodes)
+          // makes the placement policy valid.
+          targetReplicas.addAll(selectedDatanodes);
+          newMisRepDelta = containerPlacement.validateContainerPlacement(
+              targetReplicas, replicationFactor).additionalReplicaRequired();

Review comment:
       The rack-aware policy's chooseDatanodes() method we use ensures that 
based on the excluded nodes list we get back a proper list, I believe we do not 
need to check it twice Are you aware of any cases where this is not true? Based 
on my understanding of the logic there I would expect all future placement 
policies to do so, as other existing ones implicitly do so now. If necessary we 
can add this to the contract of the interface, though this is an interesting 
property of the rack-aware policy that would worth some re-thinking but that is 
out of scope for now.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to