errose28 commented on code in PR #9376:
URL: https://github.com/apache/ozone/pull/9376#discussion_r2615086041
##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SCMSafeModeManager.java:
##########
@@ -225,6 +245,63 @@ public double getCurrentContainerThreshold() {
.getCurrentContainerThreshold();
}
+ private synchronized void startSafeModePeriodicLogger() {
+ if (!getInSafeMode() || safeModeLogExecutor != null) {
+ return;
+ }
+ safeModeLogExecutor = Executors.newScheduledThreadPool(1,
+ new ThreadFactoryBuilder()
+ .setNameFormat(scmContext.threadNamePrefix() +
"SCM-SafeMode-Log-%d")
+ .setDaemon(true)
+ .build());
+ safeModeLogExecutor.scheduleAtFixedRate(() -> {
+ try {
+ logSafeModeStatus();
+ } catch (Throwable t) {
+ LOG.warn("Safe mode periodic logger encountered an error", t);
+ }
+ }, 0L, safeModeLogIntervalMs, TimeUnit.MILLISECONDS);
+ LOG.info("Started periodic Safe Mode logging with interval {} ms",
safeModeLogIntervalMs);
+ }
+
+ private void logSafeModeStatus() {
Review Comment:
Can we make this whole method synchronized instead of trying to isolate the
specific parts of it that need to be coordinated with other method calls? This
method will run quickly and not block other safemode checks. It also makes it
easier to reason about. Right now there could be some strange cases where, for
example, we read `safeModeStatus` as "in safemode", but it has left safemode by
the time we get to the logging section, in which case the rule states won't
match the status.
##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SCMSafeModeManager.java:
##########
@@ -225,6 +245,63 @@ public double getCurrentContainerThreshold() {
.getCurrentContainerThreshold();
}
+ private synchronized void startSafeModePeriodicLogger() {
+ if (!getInSafeMode() || safeModeLogExecutor != null) {
+ return;
+ }
+ safeModeLogExecutor = Executors.newScheduledThreadPool(1,
+ new ThreadFactoryBuilder()
+ .setNameFormat(scmContext.threadNamePrefix() +
"SCM-SafeMode-Log-%d")
+ .setDaemon(true)
+ .build());
+ safeModeLogExecutor.scheduleAtFixedRate(() -> {
+ try {
+ logSafeModeStatus();
+ } catch (Throwable t) {
+ LOG.warn("Safe mode periodic logger encountered an error", t);
+ }
+ }, 0L, safeModeLogIntervalMs, TimeUnit.MILLISECONDS);
+ LOG.info("Started periodic Safe Mode logging with interval {} ms",
safeModeLogIntervalMs);
+ }
+
+ private void logSafeModeStatus() {
+ if (!getInSafeMode()) {
+ stopSafeModePeriodicLogger();
+ return;
+ }
+ SafeModeStatus safeModeStatus = status.get();
+ int validatedCount;
+ int preCheckValidatedCount;
+ synchronized (this) {
+ validatedCount = validatedRules.size();
+ preCheckValidatedCount = validatedPreCheckRules.size();
+ }
+ String rules = exitRules.values().stream()
+ .map(rule -> {
+ String name = rule.getRuleName();
+ boolean isValidated;
+ synchronized (this) {
+ isValidated = validatedRules.contains(name);
+ }
+ return name + "(status=" + (isValidated ? "validated" : "waiting")
+ + ", " + rule.getStatusText() + ")";
+ })
+ .collect(Collectors.joining(", "));
+ LOG.info(
Review Comment:
I'm thinking of something like this for the log output. This has one summary
line and each rule on its own line. It can be logged as one single log message
using `\n` to separate the lines so it is not interrupted. The common prefix
still helps when searching the logs with `grep`. Note that some of the safemode
rules' messages have a semicolon at the end for some reason which we can
probably remove. Lines are usually parsed with `awk` so using a pseudo-json
layout with parentheses and brackets doesn't provide much benefit.
```
SCM SafeMode Status | state=INITIAL preCheckComplete=false
validatedPreCheckRules=0/1 validatedRules=2/5
SCM SafeMode Status | DataNodeSafeModeRule (waiting) registered datanodes
(=0) >= required datanodes (=5)
SCM SafeMode Status | HealthyPipelineSafeModeRule (waiting) healthy
Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=1)
SCM SafeMode Status | OneReplicaPipelineSafeModeRule (waiting) reported
Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)
SCM SafeMode Status | RatisContainerSafeModeRule (validated) 100.00% of
[RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >=
safeModeCutoff (=0.99)
SCM SafeMode Status | ECContainerSafeModeRule (validated) 100.00% of [EC]
Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff
(=0.99)
```
##########
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/HddsConfigKeys.java:
##########
@@ -112,6 +112,10 @@ public final class HddsConfigKeys {
public static final double
HDDS_SCM_SAFEMODE_ONE_NODE_REPORTED_PIPELINE_PCT_DEFAULT = 0.90;
+ public static final String HDDS_SCM_SAFEMODE_LOG_INTERVAL =
+ "hdds.scm.safemode.log.interval";
Review Comment:
Can we make this dynamically reconfigurable? I'm thinking of a scenario
where one safemode rule is having trouble validating and we want to adjust this
rather than stop the SCM and restart the whole safemode process.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]