sodonnel commented on code in PR #5592:
URL: https://github.com/apache/ozone/pull/5592#discussion_r1402220654


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/health/RatisReplicationCheckHandler.java:
##########
@@ -200,9 +213,37 @@ of UNHEALTHY replicas (such as 3 CLOSED and 1 UNHEALTHY 
replicas of a
     RatisContainerReplicaCount consideringUnhealthy =
         new RatisContainerReplicaCount(container, replicas, replicaPendingOps,
             minReplicasForMaintenance, true);
-    boolean isOverReplicated = consideringUnhealthy.isOverReplicated(false);
-    if (isOverReplicated) {
-      return consideringUnhealthy.toOverHealthResult();
+
+    if (consideringUnhealthy.isOverReplicated(false)) {
+      if (container.getState() == HddsProtos.LifeCycleState.CLOSED) {
+        return consideringUnhealthy.toOverHealthResult();
+      } else if (container.getState()
+          == HddsProtos.LifeCycleState.QUASI_CLOSED) {
+        // If the container is quasi-closed and over replicated, we may have a
+        // case where the excess replica is an unhealthy one, but it has a
+        // unique origin and therefore should not be deleted. In this case,
+        // we should not mark the container as over replicated.
+        // We ignore pending deletes, as a container is still over replicated
+        // until the pending delete completes.
+        ContainerReplica toDelete = ReplicationManagerUtil
+            .selectUnhealthyReplicaForDelete(container, replicas, 0,

Review Comment:
   In this part of the code, we know it is over-replicated due to unhealthy, as 
earlier we checked for over-replication without unhealthy, and we would have 
returned there if it was over replicated. So here I am checking if there is 
anything valid we can delete, which must be the unhealthy ones - if there is, 
return over replicated, otherwise return "healthy". I am passing zero to 
`ReplicationManagerUtil.selectUnhealthyReplicaForDelete` to intentionally 
ignore the deletes, so null will not be returned due to the pending deletes.
   
   The caller will decide if the container needs to be queued on the over-rep 
queue when it considers the pending delete via this existing code:
   
   ```
         if (!overHealth.isReplicatedOkAfterPending() &&
             !overHealth.hasMismatchedReplicas()) {
           /*
           A mis matched replica is one whose state does not match the
           container's state and the state is not UNHEALTHY.
           For example, a CLOSED container with 1 CLOSED, 2 CLOSING, and 1
           UNHEALTHY replica has 2 mis matched replicas (the 2 CLOSING ones).
           We want to CLOSE the mis matched replicas first before queuing the
           container for over replication.
            */
           request.getReplicationQueue().enqueue(overHealth); 
   ```
   
   I think we have to ignore the deletes as we still class a container as 
over-replicated until the pending delete completes, but we only queue it if the 
pending delete doesn't fix it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to