aryangupta1998 commented on code in PR #9651:
URL: https://github.com/apache/ozone/pull/9651#discussion_r2791735934
##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/HealthyPipelineSafeModeRule.java:
##########
@@ -123,81 +117,62 @@ protected synchronized boolean validate() {
LOG.info("All SCM pipelines are closed due to ongoing upgrade " +
"finalization. Bypassing healthy pipeline safemode rule.");
return true;
+ }
+ // Query PipelineManager directly for healthy pipeline count
+ List<Pipeline> openPipelines = pipelineManager.getPipelines(
+ RatisReplicationConfig.getInstance(HddsProtos.ReplicationFactor.THREE),
+ Pipeline.PipelineState.OPEN);
+
+ LOG.info("Found {} open RATIS/THREE pipelines", openPipelines.size());
+ currentHealthyPipelineCount = (int) openPipelines.stream()
+ .filter(this::isPipelineHealthy)
Review Comment:
We compute the threshold once in `initializeRule()` using the number of open
pipelines right after SCM startup(SCM marks pipeline as OPEN if it's report
from DN suggesting it as healthy). At that time, the open‑pipeline count is
likely small, so the threshold is low and fixed.
Later, `validate()` runs many times as DNs send pipeline reports, and the
number of open pipelines (and healthy pipelines) can increase. That means
currentHealthyPipelineCount can easily exceed the old low threshold, even if
some DNs/pipelines are unhealthy.
To avoid this, the threshold should be recomputed inside validate() using
the same snapshot of open pipelines.
Also, please write a unit test which first takes the threshold as a small
number(initialized in `initilazeRule()`), then after DN sends some reports to
SCM, we increase the threshold based on more pipelines reported, and then test
the SCM to come out of HealthyPipelineSafeModeRule.
```suggestion
List<Pipeline> openPipelines = pipelineManager.getPipelines(
RatisReplicationConfig.getInstance(HddsProtos.ReplicationFactor.THREE),
Pipeline.PipelineState.OPEN);
LOG.info("Found {} open RATIS/THREE pipelines", openPipelines.size());
currentHealthyPipelineCount = (int) openPipelines.stream()
.filter(this::isPipelineHealthy)
.count();
int pipelineCount = openPipelines.size();
healthyPipelineThresholdCount = Math.max(minHealthyPipelines,
(int) Math.ceil(healthyPipelinesPercent * pipelineCount));
currentHealthyPipelineCount = (int) openPipelines.stream()
.filter(this::isPipelineHealthy)
.count();
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]