Re: [PR] HDDS-9645. Recon doesn't exclude out-of-service nodes when checking for healthy containers [ozone]

via GitHub Mon, 27 Nov 2023 15:08:54 -0800


xBis7 commented on code in PR #5651:
URL: https://github.com/apache/ozone/pull/5651#discussion_r1406876671



##########
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthStatus.java:
##########
@@ -48,8 +48,12 @@ public class ContainerHealthStatus {
     int repFactor = container.getReplicationConfig().getRequiredNodes();
     this.healthyReplicas = healthyReplicas
         .stream()
-        .filter(r -> !r.getState()
-            .equals((ContainerReplicaProto.State.UNHEALTHY)))
+        // Filter unhealthy replicas and
+        // replicas belonging to out-of-service nodes.
+        .filter(r ->
+            (!r.getDatanodeDetails().isDecommissioned() &&
+             !r.getDatanodeDetails().isMaintenance() &&

Review Comment:
   As far as I understand, a node doesn’t go offline until its replicas have 
been copied to another node. While ENTERING_MAINTENANCE or DECOMMISSIONING 
container replicas are added or removed as needed to maintain proper 
replication. The container will be under-replicated until copies have been made 
and the node successfully becomes offline.
   
   Once that is done, the container is correctly replicated, has 3 healthy and 
available replicas and 1 offline. SCM doesn’t report any under-replicated or 
over-replicated containers but Recon 
   
   - for master, counts 1 over-replicated because it sees 4 replicas (no 
distinction between online - offline).
   - for this patch, 0 count.
   
   When the offline datanode is stopped, SCM doesn’t count unhealthy containers 
and
   
   - for master, Recon no longer counts 1 over-replicated container.
   - for this patch, no change in Recon.



##########
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthStatus.java:
##########
@@ -48,8 +48,12 @@ public class ContainerHealthStatus {
     int repFactor = container.getReplicationConfig().getRequiredNodes();
     this.healthyReplicas = healthyReplicas
         .stream()
-        .filter(r -> !r.getState()
-            .equals((ContainerReplicaProto.State.UNHEALTHY)))
+        // Filter unhealthy replicas and
+        // replicas belonging to out-of-service nodes.
+        .filter(r ->
+            (!r.getDatanodeDetails().isDecommissioned() &&
+             !r.getDatanodeDetails().isMaintenance() &&

Review Comment:
   As far as I understand, a node doesn’t go offline until its replicas have 
been copied to another node. While ENTERING_MAINTENANCE or DECOMMISSIONING 
container replicas are added or removed as needed to maintain proper 
replication. The container will be under-replicated until copies have been made 
and the node successfully becomes offline.
   
   Once that is done, the container is correctly replicated, has 3 healthy and 
available replicas and 1 offline. SCM doesn’t report any under-replicated or 
over-replicated containers but Recon 
   
   - for master, counts 1 over-replicated because it sees 4 replicas (no 
distinction between online - offline).
   - for this patch, 0 count.
   
   When the offline datanode is stopped, SCM doesn’t count unhealthy containers 
and
   
   - for master, Recon no longer counts 1 over-replicated container.
   - for this patch, no change in Recon.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-9645. Recon doesn't exclude out-of-service nodes when checking for healthy containers [ozone]

Reply via email to