[
https://issues.apache.org/jira/browse/HDDS-14989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ChenXi updated HDDS-14989:
--------------------------
Description:
# restarting the SCM Follower
# transfer the leader to the restarted SCM
## SCM should transfer out of safe mode as soon as possible, within one hour.
# read key, an NO_REPLICA_FOUND error occurred when reading the key.
## The container corresponding to the key is closed during SCM restart.
```
Exception while processing container report for container 17133024 from
datanode 4d624fc8-58ca-44a1-87b5-50964d5a5773(xxxx).
org.apache.hadoop.hdds.scm.exceptions.SCMException:
org.apache.ratis.protocol.exceptions.NotLeaderException: Server
2059f536-5846-4573-b81d-274dc495c727@group-C0BCE64451CF is not the leader,
suggested leader is: 79f352d3-c493-4176-904c-09a3d9ba0bc4|xxx:9894//... at
org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.updateContainerState(ContainerManagerImpl.java:302)
at
org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.updateContainerState(AbstractContainerReportHandler.java:264)
at
org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:121)
at
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processSingleReplica(ContainerReportHandler.java:247)
at
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:195)
at
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:50)
at
org.apache.hadoop.hdds.server.events.FixedThreadPoolWithAffinityExecutor$ContainerReportProcessTask.run(FixedThreadPoolWithAffinityExecutor.java:282)
at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: org.apache.ratis.protocol.exceptions.NotLeaderException: Server
2059f536-5846-4573-b81d-274dc495c727@group-C0BCE64451CF is not the leader,
suggested leader is: 79f352d3-c493-4176-904c-09a3d9ba0bc4|xxxx:9894
at
org.apache.ratis.server.impl.RaftServerImpl.generateNotLeaderException(RaftServerImpl.java:790)
at
org.apache.ratis.server.impl.RaftServerImpl.checkLeaderState(RaftServerImpl.java:755)//...```
h2. Root Cause
After a Follower SCM restarts, it starts accepting DataNode container reports
before fully catching up with Ratis log. Containers whose state changed during
the restart remain stale in the Follower's DB.
{{AbstractContainerReportHandler#processContainerReplica}} detects the state
mismatch and calls {{updateContainerState}} via Ratis, which throws
{{NotLeaderException}} on the Follower. This exception propagates and skips the
subsequent {{{}updateContainerReplica{}}}, losing the container's replica
location. When this Follower is promoted to Leader, these containers have
{{{}NO_REPLICA_FOUND{}}}.
h3. Details
{{processContainerReplica}} calls {{updateContainerState}} then
{{updateContainerReplica}} sequentially. Only the Leader can execute
{{updateContainerState}} (a Ratis write); on a Follower, it throws
{{{}NotLeaderException{}}}.
Under normal operation this is harmless — most containers have consistent
state, so {{updateContainerState}} is a no-op and {{updateContainerReplica}}
succeeds. The few {{NotLeaderException}} seen in
{{IncrementalContainerReportHandler}} logs are from in-flight state transitions
and resolve once the container reaches its final state.
The problem is on Follower restart: the DN protocol server starts before Ratis
log replay completes, so many recently-changed containers have stale state in
DB. FCR processing triggers a burst of {{{}NotLeaderException{}}}, each
skipping {{updateContainerReplica}} for that container. Since replica locations
are in-memory only, these containers lose all replica info until the next
successful FCR cycle.
Fix
* When SCM starts up, after it catches up to the leader's committed log
entries, start the DatanodeProtocolServer to receive reports from the Datanodes.
** Previously, it was only guaranteed that the term of the Follower's log
entries was the same as that of the leader. However, having the same term did
not guarantee that the Follower's log entries were up-to-date.
* Only allow the leader SCM to update the container via Ratis by executing
`updateContainerState`.
was:
# restarting the SCM Follower
# transfer the leader to the restarted SCM
## SCM should transfer out of safe mode as soon as possible, within one hour.
# read key, an NO_REPLICA_FOUND error occurred when reading the key.
## The container corresponding to the key is closed during SCM restart.
```
Exception while processing container report for container 17133024 from
datanode
4d624fc8-58ca-44a1-87b5-50964d5a5773(ip-10-169-59-201.idata-server.shopee.io/10.169.59.201).
org.apache.hadoop.hdds.scm.exceptions.SCMException:
org.apache.ratis.protocol.exceptions.NotLeaderException: Server
2059f536-5846-4573-b81d-274dc495c727@group-C0BCE64451CF is not the leader,
suggested leader is:
79f352d3-c493-4176-904c-09a3d9ba0bc4|ip-10-169-60-139.idata-server.shopee.io:9894//...
at
org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.updateContainerState(ContainerManagerImpl.java:302)
at
org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.updateContainerState(AbstractContainerReportHandler.java:264)
at
org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:121)
at
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processSingleReplica(ContainerReportHandler.java:247)
at
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:195)
at
org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:50)
at
org.apache.hadoop.hdds.server.events.FixedThreadPoolWithAffinityExecutor$ContainerReportProcessTask.run(FixedThreadPoolWithAffinityExecutor.java:282)
at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: org.apache.ratis.protocol.exceptions.NotLeaderException: Server
2059f536-5846-4573-b81d-274dc495c727@group-C0BCE64451CF is not the leader,
suggested leader is:
79f352d3-c493-4176-904c-09a3d9ba0bc4|ip-10-169-60-139.idata-server.shopee.io:9894
at
org.apache.ratis.server.impl.RaftServerImpl.generateNotLeaderException(RaftServerImpl.java:790)
at
org.apache.ratis.server.impl.RaftServerImpl.checkLeaderState(RaftServerImpl.java:755)//...```
h2. Root Cause
After a Follower SCM restarts, it starts accepting DataNode container reports
before fully catching up with Ratis log. Containers whose state changed during
the restart remain stale in the Follower's DB.
{{AbstractContainerReportHandler#processContainerReplica}} detects the state
mismatch and calls {{updateContainerState}} via Ratis, which throws
{{NotLeaderException}} on the Follower. This exception propagates and skips the
subsequent {{{}updateContainerReplica{}}}, losing the container's replica
location. When this Follower is promoted to Leader, these containers have
{{{}NO_REPLICA_FOUND{}}}.
h3. Details
{{processContainerReplica}} calls {{updateContainerState}} then
{{updateContainerReplica}} sequentially. Only the Leader can execute
{{updateContainerState}} (a Ratis write); on a Follower, it throws
{{{}NotLeaderException{}}}.
Under normal operation this is harmless — most containers have consistent
state, so {{updateContainerState}} is a no-op and {{updateContainerReplica}}
succeeds. The few {{NotLeaderException}} seen in
{{IncrementalContainerReportHandler}} logs are from in-flight state transitions
and resolve once the container reaches its final state.
The problem is on Follower restart: the DN protocol server starts before Ratis
log replay completes, so many recently-changed containers have stale state in
DB. FCR processing triggers a burst of {{{}NotLeaderException{}}}, each
skipping {{updateContainerReplica}} for that container. Since replica locations
are in-memory only, these containers lose all replica info until the next
successful FCR cycle.
Fix
* When SCM starts up, after it catches up to the leader's committed log
entries, start the DatanodeProtocolServer to receive reports from the Datanodes.
** Previously, it was only guaranteed that the term of the Follower's log
entries was the same as that of the leader. However, having the same term did
not guarantee that the Follower's log entries were up-to-date.
* Only allow the leader SCM to update the container via Ratis by executing
`updateContainerState`.
> Delay follower SCM DN server start until Ratis log catch-up
> -----------------------------------------------------------
>
> Key: HDDS-14989
> URL: https://issues.apache.org/jira/browse/HDDS-14989
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: ChenXi
> Assignee: ChenXi
> Priority: Major
>
> # restarting the SCM Follower
> # transfer the leader to the restarted SCM
> ## SCM should transfer out of safe mode as soon as possible, within one hour.
> # read key, an NO_REPLICA_FOUND error occurred when reading the key.
> ## The container corresponding to the key is closed during SCM restart.
>
> ```
> Exception while processing container report for container 17133024 from
> datanode 4d624fc8-58ca-44a1-87b5-50964d5a5773(xxxx).
> org.apache.hadoop.hdds.scm.exceptions.SCMException:
> org.apache.ratis.protocol.exceptions.NotLeaderException: Server
> 2059f536-5846-4573-b81d-274dc495c727@group-C0BCE64451CF is not the leader,
> suggested leader is: 79f352d3-c493-4176-904c-09a3d9ba0bc4|xxx:9894//... at
> org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.updateContainerState(ContainerManagerImpl.java:302)
> at
> org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.updateContainerState(AbstractContainerReportHandler.java:264)
> at
> org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:121)
> at
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processSingleReplica(ContainerReportHandler.java:247)
> at
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:195)
> at
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:50)
> at
> org.apache.hadoop.hdds.server.events.FixedThreadPoolWithAffinityExecutor$ContainerReportProcessTask.run(FixedThreadPoolWithAffinityExecutor.java:282)
> at
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
> at java.base/java.lang.Thread.run(Thread.java:1583)
> Caused by: org.apache.ratis.protocol.exceptions.NotLeaderException: Server
> 2059f536-5846-4573-b81d-274dc495c727@group-C0BCE64451CF is not the leader,
> suggested leader is: 79f352d3-c493-4176-904c-09a3d9ba0bc4|xxxx:9894
> at
> org.apache.ratis.server.impl.RaftServerImpl.generateNotLeaderException(RaftServerImpl.java:790)
> at
> org.apache.ratis.server.impl.RaftServerImpl.checkLeaderState(RaftServerImpl.java:755)//...```
>
>
> h2. Root Cause
> After a Follower SCM restarts, it starts accepting DataNode container reports
> before fully catching up with Ratis log. Containers whose state changed
> during the restart remain stale in the Follower's DB.
> {{AbstractContainerReportHandler#processContainerReplica}} detects the state
> mismatch and calls {{updateContainerState}} via Ratis, which throws
> {{NotLeaderException}} on the Follower. This exception propagates and skips
> the subsequent {{{}updateContainerReplica{}}}, losing the container's replica
> location. When this Follower is promoted to Leader, these containers have
> {{{}NO_REPLICA_FOUND{}}}.
> h3. Details
> {{processContainerReplica}} calls {{updateContainerState}} then
> {{updateContainerReplica}} sequentially. Only the Leader can execute
> {{updateContainerState}} (a Ratis write); on a Follower, it throws
> {{{}NotLeaderException{}}}.
> Under normal operation this is harmless — most containers have consistent
> state, so {{updateContainerState}} is a no-op and {{updateContainerReplica}}
> succeeds. The few {{NotLeaderException}} seen in
> {{IncrementalContainerReportHandler}} logs are from in-flight state
> transitions and resolve once the container reaches its final state.
> The problem is on Follower restart: the DN protocol server starts before
> Ratis log replay completes, so many recently-changed containers have stale
> state in DB. FCR processing triggers a burst of {{{}NotLeaderException{}}},
> each skipping {{updateContainerReplica}} for that container. Since replica
> locations are in-memory only, these containers lose all replica info until
> the next successful FCR cycle.
> Fix
> * When SCM starts up, after it catches up to the leader's committed log
> entries, start the DatanodeProtocolServer to receive reports from the
> Datanodes.
> ** Previously, it was only guaranteed that the term of the Follower's log
> entries was the same as that of the leader. However, having the same term did
> not guarantee that the Follower's log entries were up-to-date.
> * Only allow the leader SCM to update the container via Ratis by executing
> `updateContainerState`.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]