[jira] [Updated] (HDDS-14989) Delay follower SCM DN server start until Ratis log catch-up

ChenXi (Jira) Thu, 09 Apr 2026 03:17:30 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-14989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ChenXi updated HDDS-14989:
--------------------------
    Description: 
h1. Reproduce way
 # restarting the SCM Follower
 # transfer the leader to the restarted SCM
 ## SCM should transfer out of safe mode as soon as possible, within one hour.
 # read key, an NO_REPLICA_FOUND error occurred when reading the key.
 ## The key must be a key whose container was closed during an SCM restart.

 
h2. Root Cause

After a Follower SCM restarts, it starts accepting DataNode container reports 
before fully catching up with Ratis log. Containers whose state changed during 
the restart remain stale in the Follower's DB. 
{{AbstractContainerReportHandler#processContainerReplica}} detects the state 
mismatch and calls {{updateContainerState}} via Ratis, which throws 
{{NotLeaderException}} on the Follower. This exception propagates and skips the 
subsequent {{{}updateContainerReplica{}}}, losing the container's replica 
location. When this Follower is promoted to Leader, these containers have 
{{{}NO_REPLICA_FOUND{}}}.
h3. Details

{{processContainerReplica}} calls {{updateContainerState}} then 
{{updateContainerReplica}} sequentially. Only the Leader can execute 
{{updateContainerState}} (a Ratis write); on a Follower, it throws 
{{{}NotLeaderException{}}}.

Under normal operation this is harmless — most containers have consistent 
state, so {{updateContainerState}} is a no-op and {{updateContainerReplica}} 
succeeds. The few {{NotLeaderException}} seen in 
{{IncrementalContainerReportHandler}} logs are from in-flight state transitions 
and resolve once the container reaches its final state.

The problem is on Follower restart: the DN protocol server starts before Ratis 
log replay completes, so many recently-changed containers have stale state in 
DB. FCR processing triggers a burst of {{{}NotLeaderException{}}}, each 
skipping {{updateContainerReplica}} for that container. Since replica locations 
are in-memory only, these containers lose all replica info until the next 
successful FCR cycle.

Fix
 * When the Follower SCM starts, it starts the DatanodeProtocolServer to 
receive FCR and ICR from the Datanode after catching up with the leader's 
committed log entries.
 ** Previously, it was only guaranteed that the term of the Follower's log 
entries was the same as that of the leader. However, having the same term did 
not guarantee that the Follower's log entries were up-to-date.
 * Only allow the leader SCM to update the container via Ratis by executing 
`updateContainerState`.

 
h2. SCM LOG
{code:java}
Exception while processing container report for container 17133024 from 
datanode 4d624fc8-58ca-44a1-87b5-50964d5a5773(xxxx).
org.apache.hadoop.hdds.scm.exceptions.SCMException: 
org.apache.ratis.protocol.exceptions.NotLeaderException: Server 
2059f536-5846-4573-b81d-274dc495c727@group-C0BCE64451CF is not the leader, 
suggested leader is: 79f352d3-c493-4176-904c-09a3d9ba0bc4|xxx:9894
//... 
    at 
org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.updateContainerState(ContainerManagerImpl.java:302)
    at 
org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.updateContainerState(AbstractContainerReportHandler.java:264)
    at 
org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:121)
    at 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processSingleReplica(ContainerReportHandler.java:247)
    at 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:195)
    at 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:50)
    at 
org.apache.hadoop.hdds.server.events.FixedThreadPoolWithAffinityExecutor$ContainerReportProcessTask.run(FixedThreadPoolWithAffinityExecutor.java:282)
    at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.base/java.lang.Thread.run(Thread.java:1583)
    Caused by: org.apache.ratis.protocol.exceptions.NotLeaderException: Server 
2059f536-5846-4573-b81d-274dc495c727@group-C0BCE64451CF is not the leader, 
suggested leader is: 79f352d3-c493-4176-904c-09a3d9ba0bc4|xxxx:9894
    at 
org.apache.ratis.server.impl.RaftServerImpl.generateNotLeaderException(RaftServerImpl.java:790)
    at 
org.apache.ratis.server.impl.RaftServerImpl.checkLeaderState(RaftServerImpl.java:755)
//...
  {code}
 
 *  

  was:
h1. Reproduce way
 # restarting the SCM Follower
 # transfer the leader to the restarted SCM
 ## SCM should transfer out of safe mode as soon as possible, within one hour.
 # read key, an NO_REPLICA_FOUND error occurred when reading the key.
 ## The key must be a key whose container was closed during an SCM restart.

 
h2. Root Cause

After a Follower SCM restarts, it starts accepting DataNode container reports 
before fully catching up with Ratis log. Containers whose state changed during 
the restart remain stale in the Follower's DB. 
{{AbstractContainerReportHandler#processContainerReplica}} detects the state 
mismatch and calls {{updateContainerState}} via Ratis, which throws 
{{NotLeaderException}} on the Follower. This exception propagates and skips the 
subsequent {{{}updateContainerReplica{}}}, losing the container's replica 
location. When this Follower is promoted to Leader, these containers have 
{{{}NO_REPLICA_FOUND{}}}.
h3. Details

{{processContainerReplica}} calls {{updateContainerState}} then 
{{updateContainerReplica}} sequentially. Only the Leader can execute 
{{updateContainerState}} (a Ratis write); on a Follower, it throws 
{{{}NotLeaderException{}}}.

Under normal operation this is harmless — most containers have consistent 
state, so {{updateContainerState}} is a no-op and {{updateContainerReplica}} 
succeeds. The few {{NotLeaderException}} seen in 
{{IncrementalContainerReportHandler}} logs are from in-flight state transitions 
and resolve once the container reaches its final state.

The problem is on Follower restart: the DN protocol server starts before Ratis 
log replay completes, so many recently-changed containers have stale state in 
DB. FCR processing triggers a burst of {{{}NotLeaderException{}}}, each 
skipping {{updateContainerReplica}} for that container. Since replica locations 
are in-memory only, these containers lose all replica info until the next 
successful FCR cycle.

Fix
 * When SCM starts up, after it catches up to the leader's committed log 
entries, start the DatanodeProtocolServer to receive reports from the Datanodes.
 ** Previously, it was only guaranteed that the term of the Follower's log 
entries was the same as that of the leader. However, having the same term did 
not guarantee that the Follower's log entries were up-to-date.
 * Only allow the leader SCM to update the container via Ratis by executing 
`updateContainerState`.

 
h2. SCM LOG
{code:java}
Exception while processing container report for container 17133024 from 
datanode 4d624fc8-58ca-44a1-87b5-50964d5a5773(xxxx).
org.apache.hadoop.hdds.scm.exceptions.SCMException: 
org.apache.ratis.protocol.exceptions.NotLeaderException: Server 
2059f536-5846-4573-b81d-274dc495c727@group-C0BCE64451CF is not the leader, 
suggested leader is: 79f352d3-c493-4176-904c-09a3d9ba0bc4|xxx:9894
//... 
    at 
org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.updateContainerState(ContainerManagerImpl.java:302)
    at 
org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.updateContainerState(AbstractContainerReportHandler.java:264)
    at 
org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:121)
    at 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processSingleReplica(ContainerReportHandler.java:247)
    at 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:195)
    at 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:50)
    at 
org.apache.hadoop.hdds.server.events.FixedThreadPoolWithAffinityExecutor$ContainerReportProcessTask.run(FixedThreadPoolWithAffinityExecutor.java:282)
    at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.base/java.lang.Thread.run(Thread.java:1583)
    Caused by: org.apache.ratis.protocol.exceptions.NotLeaderException: Server 
2059f536-5846-4573-b81d-274dc495c727@group-C0BCE64451CF is not the leader, 
suggested leader is: 79f352d3-c493-4176-904c-09a3d9ba0bc4|xxxx:9894
    at 
org.apache.ratis.server.impl.RaftServerImpl.generateNotLeaderException(RaftServerImpl.java:790)
    at 
org.apache.ratis.server.impl.RaftServerImpl.checkLeaderState(RaftServerImpl.java:755)
//...
  {code}
 
 *  


> Delay follower SCM DN server start until Ratis log catch-up
> -----------------------------------------------------------
>
>                 Key: HDDS-14989
>                 URL: https://issues.apache.org/jira/browse/HDDS-14989
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: ChenXi
>            Assignee: ChenXi
>            Priority: Major
>              Labels: pull-request-available
>
> h1. Reproduce way
>  # restarting the SCM Follower
>  # transfer the leader to the restarted SCM
>  ## SCM should transfer out of safe mode as soon as possible, within one hour.
>  # read key, an NO_REPLICA_FOUND error occurred when reading the key.
>  ## The key must be a key whose container was closed during an SCM restart.
>  
> h2. Root Cause
> After a Follower SCM restarts, it starts accepting DataNode container reports 
> before fully catching up with Ratis log. Containers whose state changed 
> during the restart remain stale in the Follower's DB. 
> {{AbstractContainerReportHandler#processContainerReplica}} detects the state 
> mismatch and calls {{updateContainerState}} via Ratis, which throws 
> {{NotLeaderException}} on the Follower. This exception propagates and skips 
> the subsequent {{{}updateContainerReplica{}}}, losing the container's replica 
> location. When this Follower is promoted to Leader, these containers have 
> {{{}NO_REPLICA_FOUND{}}}.
> h3. Details
> {{processContainerReplica}} calls {{updateContainerState}} then 
> {{updateContainerReplica}} sequentially. Only the Leader can execute 
> {{updateContainerState}} (a Ratis write); on a Follower, it throws 
> {{{}NotLeaderException{}}}.
> Under normal operation this is harmless — most containers have consistent 
> state, so {{updateContainerState}} is a no-op and {{updateContainerReplica}} 
> succeeds. The few {{NotLeaderException}} seen in 
> {{IncrementalContainerReportHandler}} logs are from in-flight state 
> transitions and resolve once the container reaches its final state.
> The problem is on Follower restart: the DN protocol server starts before 
> Ratis log replay completes, so many recently-changed containers have stale 
> state in DB. FCR processing triggers a burst of {{{}NotLeaderException{}}}, 
> each skipping {{updateContainerReplica}} for that container. Since replica 
> locations are in-memory only, these containers lose all replica info until 
> the next successful FCR cycle.
> Fix
>  * When the Follower SCM starts, it starts the DatanodeProtocolServer to 
> receive FCR and ICR from the Datanode after catching up with the leader's 
> committed log entries.
>  ** Previously, it was only guaranteed that the term of the Follower's log 
> entries was the same as that of the leader. However, having the same term did 
> not guarantee that the Follower's log entries were up-to-date.
>  * Only allow the leader SCM to update the container via Ratis by executing 
> `updateContainerState`.
>  
> h2. SCM LOG
> {code:java}
> Exception while processing container report for container 17133024 from 
> datanode 4d624fc8-58ca-44a1-87b5-50964d5a5773(xxxx).
> org.apache.hadoop.hdds.scm.exceptions.SCMException: 
> org.apache.ratis.protocol.exceptions.NotLeaderException: Server 
> 2059f536-5846-4573-b81d-274dc495c727@group-C0BCE64451CF is not the leader, 
> suggested leader is: 79f352d3-c493-4176-904c-09a3d9ba0bc4|xxx:9894
> //... 
>     at 
> org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.updateContainerState(ContainerManagerImpl.java:302)
>     at 
> org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.updateContainerState(AbstractContainerReportHandler.java:264)
>     at 
> org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:121)
>     at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processSingleReplica(ContainerReportHandler.java:247)
>     at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:195)
>     at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:50)
>     at 
> org.apache.hadoop.hdds.server.events.FixedThreadPoolWithAffinityExecutor$ContainerReportProcessTask.run(FixedThreadPoolWithAffinityExecutor.java:282)
>     at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
>     at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
>     at java.base/java.lang.Thread.run(Thread.java:1583)
>     Caused by: org.apache.ratis.protocol.exceptions.NotLeaderException: 
> Server 2059f536-5846-4573-b81d-274dc495c727@group-C0BCE64451CF is not the 
> leader, suggested leader is: 79f352d3-c493-4176-904c-09a3d9ba0bc4|xxxx:9894
>     at 
> org.apache.ratis.server.impl.RaftServerImpl.generateNotLeaderException(RaftServerImpl.java:790)
>     at 
> org.apache.ratis.server.impl.RaftServerImpl.checkLeaderState(RaftServerImpl.java:755)
> //...
>   {code}
>  
>  *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-14989) Delay follower SCM DN server start until Ratis log catch-up

Reply via email to