errose28 commented on code in PR #9376:
URL: https://github.com/apache/ozone/pull/9376#discussion_r2615086041


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SCMSafeModeManager.java:
##########
@@ -225,6 +245,63 @@ public double getCurrentContainerThreshold() {
         .getCurrentContainerThreshold();
   }
 
+  private synchronized void startSafeModePeriodicLogger() {
+    if (!getInSafeMode() || safeModeLogExecutor != null) {
+      return;
+    }
+    safeModeLogExecutor = Executors.newScheduledThreadPool(1,
+        new ThreadFactoryBuilder()
+            .setNameFormat(scmContext.threadNamePrefix() + 
"SCM-SafeMode-Log-%d")
+            .setDaemon(true)
+            .build());
+    safeModeLogExecutor.scheduleAtFixedRate(() -> {
+      try {
+        logSafeModeStatus();
+      } catch (Throwable t) {
+        LOG.warn("Safe mode periodic logger encountered an error", t);
+      }
+    }, 0L, safeModeLogIntervalMs, TimeUnit.MILLISECONDS);
+    LOG.info("Started periodic Safe Mode logging with interval {} ms", 
safeModeLogIntervalMs);
+  }
+
+  private void logSafeModeStatus() {

Review Comment:
   Can we make this whole method synchronized instead of trying to isolate the 
specific parts of it that need to be coordinated with other method calls? This 
method will run quickly and not block other safemode checks. It also makes it 
easier to reason about. Right now there could be some strange cases where, for 
example, we read `safeModeStatus` as "in safemode", but it has left safemode by 
the time we get to the logging section, in which case the rule states won't 
match the status.



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SCMSafeModeManager.java:
##########
@@ -225,6 +245,63 @@ public double getCurrentContainerThreshold() {
         .getCurrentContainerThreshold();
   }
 
+  private synchronized void startSafeModePeriodicLogger() {
+    if (!getInSafeMode() || safeModeLogExecutor != null) {
+      return;
+    }
+    safeModeLogExecutor = Executors.newScheduledThreadPool(1,
+        new ThreadFactoryBuilder()
+            .setNameFormat(scmContext.threadNamePrefix() + 
"SCM-SafeMode-Log-%d")
+            .setDaemon(true)
+            .build());
+    safeModeLogExecutor.scheduleAtFixedRate(() -> {
+      try {
+        logSafeModeStatus();
+      } catch (Throwable t) {
+        LOG.warn("Safe mode periodic logger encountered an error", t);
+      }
+    }, 0L, safeModeLogIntervalMs, TimeUnit.MILLISECONDS);
+    LOG.info("Started periodic Safe Mode logging with interval {} ms", 
safeModeLogIntervalMs);
+  }
+
+  private void logSafeModeStatus() {
+    if (!getInSafeMode()) {
+      stopSafeModePeriodicLogger();
+      return;
+    }
+    SafeModeStatus safeModeStatus = status.get();
+    int validatedCount;
+    int preCheckValidatedCount;
+    synchronized (this) {
+      validatedCount = validatedRules.size();
+      preCheckValidatedCount = validatedPreCheckRules.size();
+    }
+    String rules = exitRules.values().stream()
+        .map(rule -> {
+          String name = rule.getRuleName();
+          boolean isValidated;
+          synchronized (this) {
+            isValidated = validatedRules.contains(name);
+          }
+          return name + "(status=" + (isValidated ? "validated" : "waiting")
+              + ", " + rule.getStatusText() + ")";
+        })
+        .collect(Collectors.joining(", "));
+    LOG.info(

Review Comment:
   I'm thinking of something like this for the log output. This has one summary 
line and each rule on its own line. It can be logged as one single log message 
using `\n` to separate the lines so it is not interrupted. The common prefix 
still helps when searching the logs with `grep`. Note that some of the safemode 
rules' messages have a semicolon at the end for some reason which we can 
probably remove. Lines are usually parsed with `awk` so using a pseudo-json 
layout with parentheses and brackets doesn't provide much benefit.
   
   ```
   SCM SafeMode Status | state=INITIAL preCheckComplete=false 
validatedPreCheckRules=0/1 validatedRules=2/5
   SCM SafeMode Status | DataNodeSafeModeRule (waiting) registered datanodes 
(=0) >= required datanodes (=5)
   SCM SafeMode Status | HealthyPipelineSafeModeRule (waiting) healthy 
Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=1)
   SCM SafeMode Status | OneReplicaPipelineSafeModeRule (waiting) reported 
Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)
   SCM SafeMode Status | RatisContainerSafeModeRule (validated) 100.00% of 
[RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= 
safeModeCutoff (=0.99)
   SCM SafeMode Status | ECContainerSafeModeRule (validated) 100.00% of [EC] 
Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff 
(=0.99)
   ```



##########
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/HddsConfigKeys.java:
##########
@@ -112,6 +112,10 @@ public final class HddsConfigKeys {
   public static final double
       HDDS_SCM_SAFEMODE_ONE_NODE_REPORTED_PIPELINE_PCT_DEFAULT = 0.90;
 
+  public static final String HDDS_SCM_SAFEMODE_LOG_INTERVAL =
+      "hdds.scm.safemode.log.interval";

Review Comment:
   Can we make this dynamically reconfigurable? I'm thinking of a scenario 
where one safemode rule is having trouble validating and we want to adjust this 
rather than stop the SCM and restart the whole safemode process.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to