aryangupta1998 commented on code in PR #9651:
URL: https://github.com/apache/ozone/pull/9651#discussion_r2791735934


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/HealthyPipelineSafeModeRule.java:
##########
@@ -123,81 +117,62 @@ protected synchronized boolean validate() {
       LOG.info("All SCM pipelines are closed due to ongoing upgrade " +
           "finalization. Bypassing healthy pipeline safemode rule.");
       return true;
+    }
+    // Query PipelineManager directly for healthy pipeline count
+    List<Pipeline> openPipelines = pipelineManager.getPipelines(
+        RatisReplicationConfig.getInstance(HddsProtos.ReplicationFactor.THREE),
+        Pipeline.PipelineState.OPEN);
+    
+    LOG.info("Found {} open RATIS/THREE pipelines", openPipelines.size());
+    currentHealthyPipelineCount = (int) openPipelines.stream()
+        .filter(this::isPipelineHealthy)

Review Comment:
   We compute the threshold once in `initializeRule()` using the number of open 
pipelines right after SCM startup(SCM marks pipeline as OPEN if it's report 
from DN suggesting it as healthy). At that time, the open‑pipeline count is 
likely small, so the threshold is low and fixed.
   Later, `validate()` runs many times as DNs send pipeline reports, and the 
number of open pipelines (and healthy pipelines) can increase. That means 
currentHealthyPipelineCount can easily exceed the old low threshold, even if 
some DNs/pipelines are unhealthy.
   To avoid this, the threshold should be recomputed inside validate() using 
the same snapshot of open pipelines.
   
   Also, please write a unit test which first takes the threshold as a small 
number(initialized in `initilazeRule()`), then after DN sends some reports to 
SCM, we increase the threshold based on more pipelines reported, and then test 
the SCM to come out of HealthyPipelineSafeModeRule.
   
   ```suggestion
       List<Pipeline> openPipelines = pipelineManager.getPipelines(
           
RatisReplicationConfig.getInstance(HddsProtos.ReplicationFactor.THREE),
           Pipeline.PipelineState.OPEN);
   
       LOG.info("Found {} open RATIS/THREE pipelines", openPipelines.size());
       currentHealthyPipelineCount = (int) openPipelines.stream()
           .filter(this::isPipelineHealthy)
           .count();
       int pipelineCount = openPipelines.size();
       healthyPipelineThresholdCount = Math.max(minHealthyPipelines,
           (int) Math.ceil(healthyPipelinesPercent * pipelineCount));
   
       currentHealthyPipelineCount = (int) openPipelines.stream()
           .filter(this::isPipelineHealthy)
           .count();
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to