Re: [PR] HDDS-14989. Delay follower SCM DN server start until Ratis log catch-up [ozone]

via GitHub Fri, 15 May 2026 02:52:45 -0700


sumitagrawl commented on code in PR #10059:
URL: https://github.com/apache/ozone/pull/10059#discussion_r3247254446



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java:
##########
@@ -1580,8 +1573,10 @@ public void start() throws IOException {
     }
     getBlockProtocolServer().start();
 
-    // start datanode protocol server
-    getDatanodeProtocolServer().start();
+    // If HA is enabled, start datanode protocol server once leader is ready.
+    if (!scmStorageConfig.isSCMHAEnabled()) {

Review Comment:
   This is not required, as this will delay exit of safemode rule at follower 
due to delay in DN registration.



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java:
##########
@@ -401,19 +366,100 @@ public void notifyTermIndexUpdated(long term, long 
index) {
     if (transactionBuffer != null) {
       transactionBuffer.updateLatestTrxInfo(TransactionInfo.valueOf(term, 
index));
     }
+  }
 
-    if (currentLeaderTerm.get() == term) {
-      // This means after a restart, all pending transactions have been 
applied.
+  public boolean getIsStateMachineReady() {
+    return isStateMachineReady.get();
+  }
+
+  /**
+   * Start the DN protocol server and trigger safe mode re-evaluation.
+   *
+   * <p>In HA mode the DN server is deliberately not started during
+   * {@link org.apache.hadoop.hdds.scm.server.StorageContainerManager#start()}.
+   * Instead it is deferred until the SCM state machine has caught up with
+   * the leader's committed log entries, so that DN heartbeats are processed
+   * against the latest container/pipeline state rather than a stale snapshot.
+   *
+   * <p>The method is guarded by {@code isStateMachineReady} (CAS) to ensure
+   * the non-idempotent {@code DatanodeProtocolServer.start()} is invoked
+   * exactly once.
+   */
+  private void tryStartDNServerAndRefreshSafeMode() {
+    if (isStateMachineReady.get()) {
+      return;
+    }
+    if (scm.getScmContext().isLeader() || isFollowerCaughtUp()) {
       if (isStateMachineReady.compareAndSet(false, true)) {
-        // Refresh Safemode rules state if not already done.
+        scm.getDatanodeProtocolServer().start();
         scm.getScmSafeModeManager().refreshAndValidate();
       }
-      currentLeaderTerm.set(-1L);
     }
   }
 
-  public boolean getIsStateMachineReady() {
-    return isStateMachineReady.get();
+  /**
+   * Check whether this follower's state machine has caught up with the
+   * leader's committed log entries.
+   * @return true if {@code lastAppliedIndex >= leaderCommitIndex}
+   */
+  private boolean isFollowerCaughtUp() {

Review Comment:
   We can avoid much change to StateMachine based, as this will not resolve 
problem completely, due to nature that sync from leader may not be immediate, 
and leader keep trying sync and update



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-14989. Delay follower SCM DN server start until Ratis log catch-up [ozone]

Reply via email to