[
https://issues.apache.org/jira/browse/HDDS-14989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ChenXi updated HDDS-14989:
--------------------------
Description:
h1. Reproduce way
# restarting the SCM Follower
# transfer the leader to the restarted SCM
## SCM should transfer out of safe mode as soon as possible, within one hour.
# read key, an NO_REPLICA_FOUND error occurred when reading the key.
## The key must be a key whose container was closed during an SCM restart.
h2. Root Cause
After a Follower SCM restarts, it starts accepting DataNode container reports
before fully catching up with Ratis log. Containers whose state changed during
the restart remain stale in the Follower's DB.
{{AbstractContainerReportHandler#processContainerReplica}} detects the state
mismatch and calls {{updateContainerState}} via Ratis, which throws
{{NotLeaderException}} on the Follower. This exception propagates and skips the
subsequent {{{}updateContainerReplica{}}}, losing the container's replica
location. When this Follower is promoted to Leader, these containers have
{{{}NO_REPLICA_FOUND{}}}.
h3. Details
{{processContainerReplica}} calls {{updateContainerState}} then
{{updateContainerReplica}} sequentially. Only the Leader can execute
{{updateContainerState}} (a Ratis write); on a Follower, it throws
{{{}NotLeaderException{}}}.
Under normal operation this is harmless — most containers have consistent
state, so {{updateContainerState}} is a no-op and {{updateContainerReplica}}
succeeds. The few {{NotLeaderException}} seen in
{{IncrementalContainerReportHandler}} logs are from in-flight state transitions
and resolve once the container reaches its final state.
The problem is on Follower restart: the DN protocol server starts before Ratis
log replay completes, so many recently-changed containers have stale state in
DB. FCR processing triggers a burst of {{{}NotLeaderException{}}}, each
skipping {{updateContainerReplica}} for that container. Since replica locations
are in-memory only, these containers lose all replica info until the next
successful FCR cycle.
Fix
* When the Follower SCM starts, it starts the DatanodeProtocolServer to
receive FCR and ICR from the Datanode after catching up with the leader's
committed log entries.
** Previously, it was only guaranteed that the term of the Follower's log
entries was the same as that of the leader. However, having the same term did
not guarantee that the Follower's log entries were up-to-date.
* Only allow the leader SCM to update the container via Ratis by executing
`updateContainerState`.
h2. SCM LOG
{code:java}
Exception while processing container report for container 17133024 from
datanode 4d624fc8-58ca-44a1-87b5-50964d5a5773(xxxx).
org.apache.hadoop.hdds.scm.exceptions.SCMException:
org.apache.ratis.protocol.exceptions.NotLeaderException: Server
2059f536-5846-4573-b81d-274dc495c727@group-C0BCE64451CF is not the leader,
suggested leader is: 79f352d3-c493-4176-904c-09a3d9ba0bc4|xxx:9894
//...
at
org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.updateContainerState(ContainerManagerImpl.java:302)
at
org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.updateContainerState(AbstractContainerReportHandler.java:264)
at
org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:121)
at
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processSingleReplica(ContainerReportHandler.java:247)
at
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:195)
at
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:50)
at
org.apache.hadoop.hdds.server.events.FixedThreadPoolWithAffinityExecutor$ContainerReportProcessTask.run(FixedThreadPoolWithAffinityExecutor.java:282)
at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: org.apache.ratis.protocol.exceptions.NotLeaderException: Server
2059f536-5846-4573-b81d-274dc495c727@group-C0BCE64451CF is not the leader,
suggested leader is: 79f352d3-c493-4176-904c-09a3d9ba0bc4|xxxx:9894
at
org.apache.ratis.server.impl.RaftServerImpl.generateNotLeaderException(RaftServerImpl.java:790)
at
org.apache.ratis.server.impl.RaftServerImpl.checkLeaderState(RaftServerImpl.java:755)
//...
{code}
*
was:
h1. Reproduce way
# restarting the SCM Follower
# transfer the leader to the restarted SCM
## SCM should transfer out of safe mode as soon as possible, within one hour.
# read key, an NO_REPLICA_FOUND error occurred when reading the key.
## The key must be a key whose container was closed during an SCM restart.
h2. Root Cause
After a Follower SCM restarts, it starts accepting DataNode container reports
before fully catching up with Ratis log. Containers whose state changed during
the restart remain stale in the Follower's DB.
{{AbstractContainerReportHandler#processContainerReplica}} detects the state
mismatch and calls {{updateContainerState}} via Ratis, which throws
{{NotLeaderException}} on the Follower. This exception propagates and skips the
subsequent {{{}updateContainerReplica{}}}, losing the container's replica
location. When this Follower is promoted to Leader, these containers have
{{{}NO_REPLICA_FOUND{}}}.
h3. Details
{{processContainerReplica}} calls {{updateContainerState}} then
{{updateContainerReplica}} sequentially. Only the Leader can execute
{{updateContainerState}} (a Ratis write); on a Follower, it throws
{{{}NotLeaderException{}}}.
Under normal operation this is harmless — most containers have consistent
state, so {{updateContainerState}} is a no-op and {{updateContainerReplica}}
succeeds. The few {{NotLeaderException}} seen in
{{IncrementalContainerReportHandler}} logs are from in-flight state transitions
and resolve once the container reaches its final state.
The problem is on Follower restart: the DN protocol server starts before Ratis
log replay completes, so many recently-changed containers have stale state in
DB. FCR processing triggers a burst of {{{}NotLeaderException{}}}, each
skipping {{updateContainerReplica}} for that container. Since replica locations
are in-memory only, these containers lose all replica info until the next
successful FCR cycle.
Fix
* When SCM starts up, after it catches up to the leader's committed log
entries, start the DatanodeProtocolServer to receive reports from the Datanodes.
** Previously, it was only guaranteed that the term of the Follower's log
entries was the same as that of the leader. However, having the same term did
not guarantee that the Follower's log entries were up-to-date.
* Only allow the leader SCM to update the container via Ratis by executing
`updateContainerState`.
h2. SCM LOG
{code:java}
Exception while processing container report for container 17133024 from
datanode 4d624fc8-58ca-44a1-87b5-50964d5a5773(xxxx).
org.apache.hadoop.hdds.scm.exceptions.SCMException:
org.apache.ratis.protocol.exceptions.NotLeaderException: Server
2059f536-5846-4573-b81d-274dc495c727@group-C0BCE64451CF is not the leader,
suggested leader is: 79f352d3-c493-4176-904c-09a3d9ba0bc4|xxx:9894
//...
at
org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.updateContainerState(ContainerManagerImpl.java:302)
at
org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.updateContainerState(AbstractContainerReportHandler.java:264)
at
org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:121)
at
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processSingleReplica(ContainerReportHandler.java:247)
at
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:195)
at
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:50)
at
org.apache.hadoop.hdds.server.events.FixedThreadPoolWithAffinityExecutor$ContainerReportProcessTask.run(FixedThreadPoolWithAffinityExecutor.java:282)
at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: org.apache.ratis.protocol.exceptions.NotLeaderException: Server
2059f536-5846-4573-b81d-274dc495c727@group-C0BCE64451CF is not the leader,
suggested leader is: 79f352d3-c493-4176-904c-09a3d9ba0bc4|xxxx:9894
at
org.apache.ratis.server.impl.RaftServerImpl.generateNotLeaderException(RaftServerImpl.java:790)
at
org.apache.ratis.server.impl.RaftServerImpl.checkLeaderState(RaftServerImpl.java:755)
//...
{code}
*
> Delay follower SCM DN server start until Ratis log catch-up
> -----------------------------------------------------------
>
> Key: HDDS-14989
> URL: https://issues.apache.org/jira/browse/HDDS-14989
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: ChenXi
> Assignee: ChenXi
> Priority: Major
> Labels: pull-request-available
>
> h1. Reproduce way
> # restarting the SCM Follower
> # transfer the leader to the restarted SCM
> ## SCM should transfer out of safe mode as soon as possible, within one hour.
> # read key, an NO_REPLICA_FOUND error occurred when reading the key.
> ## The key must be a key whose container was closed during an SCM restart.
>
> h2. Root Cause
> After a Follower SCM restarts, it starts accepting DataNode container reports
> before fully catching up with Ratis log. Containers whose state changed
> during the restart remain stale in the Follower's DB.
> {{AbstractContainerReportHandler#processContainerReplica}} detects the state
> mismatch and calls {{updateContainerState}} via Ratis, which throws
> {{NotLeaderException}} on the Follower. This exception propagates and skips
> the subsequent {{{}updateContainerReplica{}}}, losing the container's replica
> location. When this Follower is promoted to Leader, these containers have
> {{{}NO_REPLICA_FOUND{}}}.
> h3. Details
> {{processContainerReplica}} calls {{updateContainerState}} then
> {{updateContainerReplica}} sequentially. Only the Leader can execute
> {{updateContainerState}} (a Ratis write); on a Follower, it throws
> {{{}NotLeaderException{}}}.
> Under normal operation this is harmless — most containers have consistent
> state, so {{updateContainerState}} is a no-op and {{updateContainerReplica}}
> succeeds. The few {{NotLeaderException}} seen in
> {{IncrementalContainerReportHandler}} logs are from in-flight state
> transitions and resolve once the container reaches its final state.
> The problem is on Follower restart: the DN protocol server starts before
> Ratis log replay completes, so many recently-changed containers have stale
> state in DB. FCR processing triggers a burst of {{{}NotLeaderException{}}},
> each skipping {{updateContainerReplica}} for that container. Since replica
> locations are in-memory only, these containers lose all replica info until
> the next successful FCR cycle.
> Fix
> * When the Follower SCM starts, it starts the DatanodeProtocolServer to
> receive FCR and ICR from the Datanode after catching up with the leader's
> committed log entries.
> ** Previously, it was only guaranteed that the term of the Follower's log
> entries was the same as that of the leader. However, having the same term did
> not guarantee that the Follower's log entries were up-to-date.
> * Only allow the leader SCM to update the container via Ratis by executing
> `updateContainerState`.
>
> h2. SCM LOG
> {code:java}
> Exception while processing container report for container 17133024 from
> datanode 4d624fc8-58ca-44a1-87b5-50964d5a5773(xxxx).
> org.apache.hadoop.hdds.scm.exceptions.SCMException:
> org.apache.ratis.protocol.exceptions.NotLeaderException: Server
> 2059f536-5846-4573-b81d-274dc495c727@group-C0BCE64451CF is not the leader,
> suggested leader is: 79f352d3-c493-4176-904c-09a3d9ba0bc4|xxx:9894
> //...
> at
> org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.updateContainerState(ContainerManagerImpl.java:302)
> at
> org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.updateContainerState(AbstractContainerReportHandler.java:264)
> at
> org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:121)
> at
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processSingleReplica(ContainerReportHandler.java:247)
> at
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:195)
> at
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:50)
> at
> org.apache.hadoop.hdds.server.events.FixedThreadPoolWithAffinityExecutor$ContainerReportProcessTask.run(FixedThreadPoolWithAffinityExecutor.java:282)
> at
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
> at java.base/java.lang.Thread.run(Thread.java:1583)
> Caused by: org.apache.ratis.protocol.exceptions.NotLeaderException:
> Server 2059f536-5846-4573-b81d-274dc495c727@group-C0BCE64451CF is not the
> leader, suggested leader is: 79f352d3-c493-4176-904c-09a3d9ba0bc4|xxxx:9894
> at
> org.apache.ratis.server.impl.RaftServerImpl.generateNotLeaderException(RaftServerImpl.java:790)
> at
> org.apache.ratis.server.impl.RaftServerImpl.checkLeaderState(RaftServerImpl.java:755)
> //...
> {code}
>
> *
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]