[GitHub] [ozone] sodonnel opened a new pull request, #4561: HDDS-8336. ReplicationManager: RatisUnderReplicationHandler should partially recover the container if not enough nodes

via GitHub Wed, 12 Apr 2023 09:34:23 -0700


sodonnel opened a new pull request, #4561:
URL: https://github.com/apache/ozone/pull/4561


   ## What changes were proposed in this pull request?
   
   In RatisUnderReplicationHandler, if the container is under-replicated by 
more than 1 replica, then we need to select 2 or more new copies. In the code 
getTargets():
   
   ```
     private List<DatanodeDetails> getTargets(
         RatisContainerReplicaCount replicaCount,
         List<ContainerReplicaOp> pendingOps) throws IOException {
       // DNs that already have replicas cannot be targets and should be 
excluded
       final List<DatanodeDetails> excludeList =
           replicaCount.getReplicas().stream()
               .map(ContainerReplica::getDatanodeDetails)
               .collect(Collectors.toList());
   
       // DNs that are already waiting to receive replicas cannot be targets
       final List<DatanodeDetails> pendingReplication =
           pendingOps.stream()
               .filter(containerReplicaOp -> containerReplicaOp.getOpType() ==
                   ContainerReplicaOp.PendingOpType.ADD)
               .map(ContainerReplicaOp::getTarget)
               .collect(Collectors.toList());
       excludeList.addAll(pendingReplication);
   
       /*
       Ensure that target datanodes have enough space to hold a complete
       container.
       */
       final long dataSizeRequired =
           Math.max(replicaCount.getContainer().getUsedBytes(),
               currentContainerSize);
       return placementPolicy.chooseDatanodes(excludeList, null,
           replicaCount.additionalReplicaNeeded(), 0, dataSizeRequired);
   }
   ```
   
   We ask the placement policy for the required number of nodes. If it cannot 
provide the required number (eg we want 2, but it can only find 1 spare node), 
it will throw a SCMException. In that case, we should try again requesting one 
less node to see if we can partially recover. If that fails, try again with 
another less node until we reach zero.
   
   If we successfully send a command for some but not all new copies, then we 
should throw an exception at the end so the container is re-queued on the 
under-rep queue to be tried again in a short while.
   
   Note the class `MisReplicationHandler.getTargetDatanodes()` currently has 
logic like this
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-8336
   
   ## How was this patch tested?
   
   New unit test added
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [ozone] sodonnel opened a new pull request, #4561: HDDS-8336. ReplicationManager: RatisUnderReplicationHandler should partially recover the container if not enough nodes

Reply via email to