Re: [PR] HDDS-9763. Over Replication Check of all UNHEALTHY replicas is broken [ozone]

via GitHub Wed, 29 Nov 2023 22:21:29 -0800


siddhantsangwan commented on code in PR #5678:
URL: https://github.com/apache/ozone/pull/5678#discussion_r1410205380



##########
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/replication/health/TestRatisReplicationCheckHandler.java:
##########
@@ -500,6 +501,43 @@ public void 
testOverReplicatedContainerWithMismatchedReplicas() {
         ReplicationManagerReport.HealthState.OVER_REPLICATED));
   }
 
+  @Test
+  public void shouldQueueForOverReplicationOnlyWhenSafe() {
+    ContainerInfo container =
+        createContainerInfo(repConfig, 1L, HddsProtos.LifeCycleState.CLOSED);
+    Set<ContainerReplica> replicas = createReplicas(container.containerID(),
+        ContainerReplicaProto.State.CLOSED, 0, 0);
+    ContainerReplica unhealthyReplica =
+        createContainerReplica(container.containerID(), 0, IN_SERVICE,
+            ContainerReplicaProto.State.UNHEALTHY);
+    ContainerReplica mismatchedReplica =
+        createContainerReplica(container.containerID(), 0, IN_SERVICE,
+            ContainerReplicaProto.State.QUASI_CLOSED);
+    replicas.add(mismatchedReplica);
+    replicas.add(unhealthyReplica);
+
+    requestBuilder.setContainerReplicas(replicas)
+        .setContainerInfo(container);
+
+    ContainerHealthResult.OverReplicatedHealthResult
+        result = (ContainerHealthResult.OverReplicatedHealthResult)
+        healthCheck.checkHealth(requestBuilder.build());
+
+    assertEquals(ContainerHealthResult.HealthState.OVER_REPLICATED,
+        result.getHealthState());
+    assertEquals(1, result.getExcessRedundancy());
+    assertFalse(result.isReplicatedOkAfterPending());
+
+    // not safe for over replication because we don't have 3 matching replicas
+    assertFalse(result.isSafelyOverReplicated());
+
+    assertTrue(healthCheck.handle(requestBuilder.build()));
+    assertEquals(0, repQueue.underReplicatedQueueSize());
+    assertEquals(0, repQueue.overReplicatedQueueSize());
+    assertEquals(1, report.getStat(

Review Comment:
   It would lead to questions, but I think those would be valid questions from 
an observability perspective? We expect the quasi closed replica to get closed 
soon, so there can be 3 closed replicas and the unhealthy one can be deleted. 
If that is not happening for a long time, then there's certainly a problem with 
the cluster - maybe the datanode hosting the quasi closed replica is going to 
be stale, the close command isn't reaching the datanode, there's some error in 
the datanode while processing the command etc. If not over replicated then 
what? How do we show there's an issue?
   
   Also I wonder if we should instead go ahead and delete the unhealthy replica 
without waiting for the quasi closed replica. I chose against doing this 
because the future improvements that we're planning to work on should make it 
so that the unhealthy replica does not block decommission. It also seems safer 
to not send any delete commands until we have 3 closed replicas. On the other 
hand, that unhealthy replica continues to take space on a DN if closing the 
quasi closed replica is blocked.
   
   What do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-9763. Over Replication Check of all UNHEALTHY replicas is broken [ozone]

Reply via email to