[jira] [Updated] (HDDS-14989) Delay follower SCM DN server start until Ratis log catch-up

ChenXi (Jira) Tue, 07 Apr 2026 02:54:07 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-14989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ChenXi updated HDDS-14989:
--------------------------
    Description: 
# restarting the SCM Follower
 # transfer the leader to the restarted SCM
 ## SCM should transfer out of safe mode as soon as possible, within one hour.
 # read key, an NO_REPLICA_FOUND error occurred when reading the key.
 ## The container corresponding to the key is closed during SCM restart.

 

```
Exception while processing container report for container 17133024 from 
datanode 4d624fc8-58ca-44a1-87b5-50964d5a5773(xxxx).
org.apache.hadoop.hdds.scm.exceptions.SCMException: 
org.apache.ratis.protocol.exceptions.NotLeaderException: Server 
2059f536-5846-4573-b81d-274dc495c727@group-C0BCE64451CF is not the leader, 
suggested leader is: 79f352d3-c493-4176-904c-09a3d9ba0bc4|xxx:9894//... at 
org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.updateContainerState(ContainerManagerImpl.java:302)
at 
org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.updateContainerState(AbstractContainerReportHandler.java:264)
at 
org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:121)
at 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processSingleReplica(ContainerReportHandler.java:247)
at 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:195)
at 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:50)
at 
org.apache.hadoop.hdds.server.events.FixedThreadPoolWithAffinityExecutor$ContainerReportProcessTask.run(FixedThreadPoolWithAffinityExecutor.java:282)
at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: org.apache.ratis.protocol.exceptions.NotLeaderException: Server 
2059f536-5846-4573-b81d-274dc495c727@group-C0BCE64451CF is not the leader, 
suggested leader is: 79f352d3-c493-4176-904c-09a3d9ba0bc4|xxxx:9894
at 
org.apache.ratis.server.impl.RaftServerImpl.generateNotLeaderException(RaftServerImpl.java:790)
at 
org.apache.ratis.server.impl.RaftServerImpl.checkLeaderState(RaftServerImpl.java:755)//...```
 
 
h2. Root Cause

After a Follower SCM restarts, it starts accepting DataNode container reports 
before fully catching up with Ratis log. Containers whose state changed during 
the restart remain stale in the Follower's DB. 
{{AbstractContainerReportHandler#processContainerReplica}} detects the state 
mismatch and calls {{updateContainerState}} via Ratis, which throws 
{{NotLeaderException}} on the Follower. This exception propagates and skips the 
subsequent {{{}updateContainerReplica{}}}, losing the container's replica 
location. When this Follower is promoted to Leader, these containers have 
{{{}NO_REPLICA_FOUND{}}}.
h3. Details

{{processContainerReplica}} calls {{updateContainerState}} then 
{{updateContainerReplica}} sequentially. Only the Leader can execute 
{{updateContainerState}} (a Ratis write); on a Follower, it throws 
{{{}NotLeaderException{}}}.

Under normal operation this is harmless — most containers have consistent 
state, so {{updateContainerState}} is a no-op and {{updateContainerReplica}} 
succeeds. The few {{NotLeaderException}} seen in 
{{IncrementalContainerReportHandler}} logs are from in-flight state transitions 
and resolve once the container reaches its final state.

The problem is on Follower restart: the DN protocol server starts before Ratis 
log replay completes, so many recently-changed containers have stale state in 
DB. FCR processing triggers a burst of {{{}NotLeaderException{}}}, each 
skipping {{updateContainerReplica}} for that container. Since replica locations 
are in-memory only, these containers lose all replica info until the next 
successful FCR cycle.

Fix
 * When SCM starts up, after it catches up to the leader's committed log 
entries, start the DatanodeProtocolServer to receive reports from the Datanodes.
 ** Previously, it was only guaranteed that the term of the Follower's log 
entries was the same as that of the leader. However, having the same term did 
not guarantee that the Follower's log entries were up-to-date.
 * Only allow the leader SCM to update the container via Ratis by executing 
`updateContainerState`.

  was:
# restarting the SCM Follower
 # transfer the leader to the restarted SCM
 ## SCM should transfer out of safe mode as soon as possible, within one hour.
 # read key, an NO_REPLICA_FOUND error occurred when reading the key.
 ## The container corresponding to the key is closed during SCM restart.

 

```
Exception while processing container report for container 17133024 from 
datanode 
4d624fc8-58ca-44a1-87b5-50964d5a5773(ip-10-169-59-201.idata-server.shopee.io/10.169.59.201).
org.apache.hadoop.hdds.scm.exceptions.SCMException: 
org.apache.ratis.protocol.exceptions.NotLeaderException: Server 
2059f536-5846-4573-b81d-274dc495c727@group-C0BCE64451CF is not the leader, 
suggested leader is: 
79f352d3-c493-4176-904c-09a3d9ba0bc4|ip-10-169-60-139.idata-server.shopee.io:9894//...
      at 
org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.updateContainerState(ContainerManagerImpl.java:302)
        at 
org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.updateContainerState(AbstractContainerReportHandler.java:264)
        at 
org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:121)
        at 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processSingleReplica(ContainerReportHandler.java:247)
        at 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:195)
        at 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:50)
        at 
org.apache.hadoop.hdds.server.events.FixedThreadPoolWithAffinityExecutor$ContainerReportProcessTask.run(FixedThreadPoolWithAffinityExecutor.java:282)
        at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: org.apache.ratis.protocol.exceptions.NotLeaderException: Server 
2059f536-5846-4573-b81d-274dc495c727@group-C0BCE64451CF is not the leader, 
suggested leader is: 
79f352d3-c493-4176-904c-09a3d9ba0bc4|ip-10-169-60-139.idata-server.shopee.io:9894
        at 
org.apache.ratis.server.impl.RaftServerImpl.generateNotLeaderException(RaftServerImpl.java:790)
        at 
org.apache.ratis.server.impl.RaftServerImpl.checkLeaderState(RaftServerImpl.java:755)//...```
 
 
h2. Root Cause

After a Follower SCM restarts, it starts accepting DataNode container reports 
before fully catching up with Ratis log. Containers whose state changed during 
the restart remain stale in the Follower's DB. 
{{AbstractContainerReportHandler#processContainerReplica}} detects the state 
mismatch and calls {{updateContainerState}} via Ratis, which throws 
{{NotLeaderException}} on the Follower. This exception propagates and skips the 
subsequent {{{}updateContainerReplica{}}}, losing the container's replica 
location. When this Follower is promoted to Leader, these containers have 
{{{}NO_REPLICA_FOUND{}}}.
h3. Details

{{processContainerReplica}} calls {{updateContainerState}} then 
{{updateContainerReplica}} sequentially. Only the Leader can execute 
{{updateContainerState}} (a Ratis write); on a Follower, it throws 
{{{}NotLeaderException{}}}.

Under normal operation this is harmless — most containers have consistent 
state, so {{updateContainerState}} is a no-op and {{updateContainerReplica}} 
succeeds. The few {{NotLeaderException}} seen in 
{{IncrementalContainerReportHandler}} logs are from in-flight state transitions 
and resolve once the container reaches its final state.

The problem is on Follower restart: the DN protocol server starts before Ratis 
log replay completes, so many recently-changed containers have stale state in 
DB. FCR processing triggers a burst of {{{}NotLeaderException{}}}, each 
skipping {{updateContainerReplica}} for that container. Since replica locations 
are in-memory only, these containers lose all replica info until the next 
successful FCR cycle.

Fix
 * When SCM starts up, after it catches up to the leader's committed log 
entries, start the DatanodeProtocolServer to receive reports from the Datanodes.
 ** Previously, it was only guaranteed that the term of the Follower's log 
entries was the same as that of the leader. However, having the same term did 
not guarantee that the Follower's log entries were up-to-date.
 * Only allow the leader SCM to update the container via Ratis by executing 
`updateContainerState`.


> Delay follower SCM DN server start until Ratis log catch-up
> -----------------------------------------------------------
>
>                 Key: HDDS-14989
>                 URL: https://issues.apache.org/jira/browse/HDDS-14989
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: ChenXi
>            Assignee: ChenXi
>            Priority: Major
>
> # restarting the SCM Follower
>  # transfer the leader to the restarted SCM
>  ## SCM should transfer out of safe mode as soon as possible, within one hour.
>  # read key, an NO_REPLICA_FOUND error occurred when reading the key.
>  ## The container corresponding to the key is closed during SCM restart.
>  
> ```
> Exception while processing container report for container 17133024 from 
> datanode 4d624fc8-58ca-44a1-87b5-50964d5a5773(xxxx).
> org.apache.hadoop.hdds.scm.exceptions.SCMException: 
> org.apache.ratis.protocol.exceptions.NotLeaderException: Server 
> 2059f536-5846-4573-b81d-274dc495c727@group-C0BCE64451CF is not the leader, 
> suggested leader is: 79f352d3-c493-4176-904c-09a3d9ba0bc4|xxx:9894//... at 
> org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.updateContainerState(ContainerManagerImpl.java:302)
> at 
> org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.updateContainerState(AbstractContainerReportHandler.java:264)
> at 
> org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:121)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processSingleReplica(ContainerReportHandler.java:247)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:195)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:50)
> at 
> org.apache.hadoop.hdds.server.events.FixedThreadPoolWithAffinityExecutor$ContainerReportProcessTask.run(FixedThreadPoolWithAffinityExecutor.java:282)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
> at java.base/java.lang.Thread.run(Thread.java:1583)
> Caused by: org.apache.ratis.protocol.exceptions.NotLeaderException: Server 
> 2059f536-5846-4573-b81d-274dc495c727@group-C0BCE64451CF is not the leader, 
> suggested leader is: 79f352d3-c493-4176-904c-09a3d9ba0bc4|xxxx:9894
> at 
> org.apache.ratis.server.impl.RaftServerImpl.generateNotLeaderException(RaftServerImpl.java:790)
> at 
> org.apache.ratis.server.impl.RaftServerImpl.checkLeaderState(RaftServerImpl.java:755)//...```
>  
>  
> h2. Root Cause
> After a Follower SCM restarts, it starts accepting DataNode container reports 
> before fully catching up with Ratis log. Containers whose state changed 
> during the restart remain stale in the Follower's DB. 
> {{AbstractContainerReportHandler#processContainerReplica}} detects the state 
> mismatch and calls {{updateContainerState}} via Ratis, which throws 
> {{NotLeaderException}} on the Follower. This exception propagates and skips 
> the subsequent {{{}updateContainerReplica{}}}, losing the container's replica 
> location. When this Follower is promoted to Leader, these containers have 
> {{{}NO_REPLICA_FOUND{}}}.
> h3. Details
> {{processContainerReplica}} calls {{updateContainerState}} then 
> {{updateContainerReplica}} sequentially. Only the Leader can execute 
> {{updateContainerState}} (a Ratis write); on a Follower, it throws 
> {{{}NotLeaderException{}}}.
> Under normal operation this is harmless — most containers have consistent 
> state, so {{updateContainerState}} is a no-op and {{updateContainerReplica}} 
> succeeds. The few {{NotLeaderException}} seen in 
> {{IncrementalContainerReportHandler}} logs are from in-flight state 
> transitions and resolve once the container reaches its final state.
> The problem is on Follower restart: the DN protocol server starts before 
> Ratis log replay completes, so many recently-changed containers have stale 
> state in DB. FCR processing triggers a burst of {{{}NotLeaderException{}}}, 
> each skipping {{updateContainerReplica}} for that container. Since replica 
> locations are in-memory only, these containers lose all replica info until 
> the next successful FCR cycle.
> Fix
>  * When SCM starts up, after it catches up to the leader's committed log 
> entries, start the DatanodeProtocolServer to receive reports from the 
> Datanodes.
>  ** Previously, it was only guaranteed that the term of the Follower's log 
> entries was the same as that of the leader. However, having the same term did 
> not guarantee that the Follower's log entries were up-to-date.
>  * Only allow the leader SCM to update the container via Ratis by executing 
> `updateContainerState`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-14989) Delay follower SCM DN server start until Ratis log catch-up

Reply via email to