[
https://issues.apache.org/jira/browse/RATIS-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xinhao GU updated RATIS-2283:
-----------------------------
Description:
h3. *Description*
After the {{GrpcLogAppender}} thread restarts, the {{catchup}} flag in the
corresponding {{followerInfo}} remains {{{}false{}}}. This causes a critical
issue when performing a membership change (e.g., adding a new peer).
Specifically, {{checkProgress}} cannot complete, leading to indefinite blocking
during the reconfiguration phase.
h3. Related Log
{code:java}
2025-03-14 17:57:43,342 [grpc-default-executor-41] WARN
o.a.r.g.s.GrpcAdminProtocolService:79 - Failed to setConfiguration:
SetConfigurationRequest:client-7BB03FC6049C->5@group-000200000014, cid=15,
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[3|192.168.10.72:10750,
4|192.168.10.73:10750, 5|192.168.10.74:10750, 7|192.168.10.76:10750],
listeners:[]
java.util.concurrent.CompletionException:
org.apache.ratis.protocol.exceptions.ReconfigurationInProgressException:
Reconfiguration is already in progress: conf: {index: 1,
cur=peers:[3|192.168.10.72:10750, 4|192.168.10.73:10750,
5|192.168.10.74:10750]|listeners:[], old=null}
at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:122)
at
org.apache.ratis.server.impl.RaftServerImpl.lambda$executeSubmitServerRequestAsync$10(RaftServerImpl.java:902)
at
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by:
org.apache.ratis.protocol.exceptions.ReconfigurationInProgressException:
Reconfiguration is already in progress: conf: {index: 1,
cur=peers:[3|192.168.10.72:10750, 4|192.168.10.73:10750,
5|192.168.10.74:10750]|listeners:[], old=null}
at
org.apache.ratis.server.impl.RaftServerImpl.setConfigurationAsync(RaftServerImpl.java:1320)
at
org.apache.ratis.server.impl.RaftServerProxy.lambda$null$24(RaftServerProxy.java:623)
at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:118)
... 5 common frames omitted {code}
h3. Cause Analysis
# *Direct Cause:*
When the {{GrpcLogAppender}} restarts, its {{catchup}} flag stays
{{{}false{}}}, and {{hasAttemptedToInstallSnapshot}} is unsatisfied. This
prevents the {{stagingState}} phase of reconfiguration from completing.
# *Code-Level Issue:*
In {{{}shouldInstallSnapshot(){}}}, the current logic only allows new followers
in the {{stagingState}} to trigger snapshot installation. Restarted _old
followers_ (e.g., after {{GrpcLogAppender}} restart) are excluded, causing
their {{catchup}} flag to remain {{false}} indefinitely.
h3. Related Code
{code:java}
private BootStrapProgress checkProgress(FollowerInfo follower, long committed) {
long confIndex = server.getRaftConf().getLogEntryIndex();
long matchIndex = follower.getMatchIndex();
LOG.info("{}: checkProgress for {}, commitIndex={}, conf'LogEntryIndex={}",
this, follower, committed, confIndex);
Preconditions.assertTrue(!isCaughtUp(follower));
final Timestamp progressTime =
Timestamp.currentTime().addTimeMs(-server.getMaxTimeoutMs());
final Timestamp timeoutTime =
Timestamp.currentTime().addTimeMs(-stagingTimeout.toLong(TimeUnit.MILLISECONDS));
if (follower.getLastRpcResponseTime().compareTo(timeoutTime) < 0) {
LOG.info("{} detects a follower {} timeout ({}ms) for bootstrapping", this,
follower,
follower.getLastRpcResponseTime().elapsedTimeMs());
return BootStrapProgress.NOPROGRESS;
} else if (matchIndex + stagingCatchupGap > committed
&& matchIndex >= confIndex
&& follower.getLastRpcResponseTime().compareTo(progressTime) > 0
&& follower.hasAttemptedToInstallSnapshot()) {
LOG.info("{} detects a follower {} is caught up for bootstrapping,
follower's matchIndex={}, Attempted={}", this, follower, matchIndex,
follower.hasAttemptedToInstallSnapshot());
return BootStrapProgress.CAUGHTUP;
} else {
LOG.info("{} detects a follower {} is making progress for bootstrapping,
follower's matchIndex={}, Attempted={}", this, follower, matchIndex,
follower.hasAttemptedToInstallSnapshot());
return BootStrapProgress.PROGRESSING;
}
} {code}
h3. *Proposed Fix:*
{*}Adjust {{shouldInstallSnapshot()}} Logic{*}: Allow snapshot installation for
both _new followers_ in {{stagingState}} and {_}restarted old followers{_}.
!image-2025-04-22-15-57-53-633.png|width=849,height=264!
h3. Pull Request
https://github.com/apache/ratis/pull/1250
was:
h3. *Description*
After the {{GrpcLogAppender}} thread restarts, the {{catchup}} flag in the
corresponding {{followerInfo}} remains {{{}false{}}}. This causes a critical
issue when performing a membership change (e.g., adding a new peer).
Specifically, {{checkProgress}} cannot complete, leading to indefinite blocking
during the reconfiguration phase.
h3. Related Log
{code:java}
2025-03-14 17:57:43,342 [grpc-default-executor-41] WARN
o.a.r.g.s.GrpcAdminProtocolService:79 - Failed to setConfiguration:
SetConfigurationRequest:client-7BB03FC6049C->5@group-000200000014, cid=15,
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[3|192.168.10.72:10750,
4|192.168.10.73:10750, 5|192.168.10.74:10750, 7|192.168.10.76:10750],
listeners:[]
java.util.concurrent.CompletionException:
org.apache.ratis.protocol.exceptions.ReconfigurationInProgressException:
Reconfiguration is already in progress: conf: {index: 1,
cur=peers:[3|192.168.10.72:10750, 4|192.168.10.73:10750,
5|192.168.10.74:10750]|listeners:[], old=null}
at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:122)
at
org.apache.ratis.server.impl.RaftServerImpl.lambda$executeSubmitServerRequestAsync$10(RaftServerImpl.java:902)
at
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by:
org.apache.ratis.protocol.exceptions.ReconfigurationInProgressException:
Reconfiguration is already in progress: conf: {index: 1,
cur=peers:[3|192.168.10.72:10750, 4|192.168.10.73:10750,
5|192.168.10.74:10750]|listeners:[], old=null}
at
org.apache.ratis.server.impl.RaftServerImpl.setConfigurationAsync(RaftServerImpl.java:1320)
at
org.apache.ratis.server.impl.RaftServerProxy.lambda$null$24(RaftServerProxy.java:623)
at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:118)
... 5 common frames omitted {code}
h3. Cause Analysis
# *Direct Cause:*
When the {{GrpcLogAppender}} restarts, its {{catchup}} flag stays
{{{}false{}}}, and {{hasAttemptedToInstallSnapshot}} is unsatisfied. This
prevents the {{stagingState}} phase of reconfiguration from completing.
# *Code-Level Issue:*
In {{{}shouldInstallSnapshot(){}}}, the current logic only allows new followers
in the {{stagingState}} to trigger snapshot installation. Restarted _old
followers_ (e.g., after {{GrpcLogAppender}} restart) are excluded, causing
their {{catchup}} flag to remain {{false}} indefinitely.
h3. Related Code
{code:java}
private BootStrapProgress checkProgress(FollowerInfo follower, long committed) {
long confIndex = server.getRaftConf().getLogEntryIndex();
long matchIndex = follower.getMatchIndex();
LOG.info("{}: checkProgress for {}, commitIndex={}, conf'LogEntryIndex={}",
this, follower, committed, confIndex);
Preconditions.assertTrue(!isCaughtUp(follower));
final Timestamp progressTime =
Timestamp.currentTime().addTimeMs(-server.getMaxTimeoutMs());
final Timestamp timeoutTime =
Timestamp.currentTime().addTimeMs(-stagingTimeout.toLong(TimeUnit.MILLISECONDS));
if (follower.getLastRpcResponseTime().compareTo(timeoutTime) < 0) {
LOG.info("{} detects a follower {} timeout ({}ms) for bootstrapping", this,
follower,
follower.getLastRpcResponseTime().elapsedTimeMs());
return BootStrapProgress.NOPROGRESS;
} else if (matchIndex + stagingCatchupGap > committed
&& matchIndex >= confIndex
&& follower.getLastRpcResponseTime().compareTo(progressTime) > 0
&& follower.hasAttemptedToInstallSnapshot()) {
LOG.info("{} detects a follower {} is caught up for bootstrapping,
follower's matchIndex={}, Attempted={}", this, follower, matchIndex,
follower.hasAttemptedToInstallSnapshot());
return BootStrapProgress.CAUGHTUP;
} else {
LOG.info("{} detects a follower {} is making progress for bootstrapping,
follower's matchIndex={}, Attempted={}", this, follower, matchIndex,
follower.hasAttemptedToInstallSnapshot());
return BootStrapProgress.PROGRESSING;
}
} {code}
h3. *Proposed Fix:*
{*}Adjust {{shouldInstallSnapshot()}} Logic{*}: Allow snapshot installation for
both _new followers_ in {{stagingState}} and {_}restarted old followers{_}.
!image-2025-04-22-15-57-53-633.png|width=849,height=264!
> GrpcLogAppender Thread Restart Leaves catchup=false, Blocking Reconfiguration
> Progress
> --------------------------------------------------------------------------------------
>
> Key: RATIS-2283
> URL: https://issues.apache.org/jira/browse/RATIS-2283
> Project: Ratis
> Issue Type: Bug
> Reporter: Xinhao GU
> Assignee: Xinhao GU
> Priority: Major
> Attachments: image-2025-04-22-15-57-53-633.png
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> h3. *Description*
> After the {{GrpcLogAppender}} thread restarts, the {{catchup}} flag in the
> corresponding {{followerInfo}} remains {{{}false{}}}. This causes a critical
> issue when performing a membership change (e.g., adding a new peer).
> Specifically, {{checkProgress}} cannot complete, leading to indefinite
> blocking during the reconfiguration phase.
> h3. Related Log
> {code:java}
> 2025-03-14 17:57:43,342 [grpc-default-executor-41] WARN
> o.a.r.g.s.GrpcAdminProtocolService:79 - Failed to setConfiguration:
> SetConfigurationRequest:client-7BB03FC6049C->5@group-000200000014, cid=15,
> seq=null, RW, null, SET_UNCONDITIONALLY, servers:[3|192.168.10.72:10750,
> 4|192.168.10.73:10750, 5|192.168.10.74:10750, 7|192.168.10.76:10750],
> listeners:[]
> java.util.concurrent.CompletionException:
> org.apache.ratis.protocol.exceptions.ReconfigurationInProgressException:
> Reconfiguration is already in progress: conf: {index: 1,
> cur=peers:[3|192.168.10.72:10750, 4|192.168.10.73:10750,
> 5|192.168.10.74:10750]|listeners:[], old=null}
> at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:122)
> at
> org.apache.ratis.server.impl.RaftServerImpl.lambda$executeSubmitServerRequestAsync$10(RaftServerImpl.java:902)
> at
> java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> at java.base/java.lang.Thread.run(Thread.java:833)
> Caused by:
> org.apache.ratis.protocol.exceptions.ReconfigurationInProgressException:
> Reconfiguration is already in progress: conf: {index: 1,
> cur=peers:[3|192.168.10.72:10750, 4|192.168.10.73:10750,
> 5|192.168.10.74:10750]|listeners:[], old=null}
> at
> org.apache.ratis.server.impl.RaftServerImpl.setConfigurationAsync(RaftServerImpl.java:1320)
> at
> org.apache.ratis.server.impl.RaftServerProxy.lambda$null$24(RaftServerProxy.java:623)
> at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:118)
> ... 5 common frames omitted {code}
> h3. Cause Analysis
> # *Direct Cause:*
> When the {{GrpcLogAppender}} restarts, its {{catchup}} flag stays
> {{{}false{}}}, and {{hasAttemptedToInstallSnapshot}} is unsatisfied. This
> prevents the {{stagingState}} phase of reconfiguration from completing.
> # *Code-Level Issue:*
> In {{{}shouldInstallSnapshot(){}}}, the current logic only allows new
> followers in the {{stagingState}} to trigger snapshot installation. Restarted
> _old followers_ (e.g., after {{GrpcLogAppender}} restart) are excluded,
> causing their {{catchup}} flag to remain {{false}} indefinitely.
> h3. Related Code
> {code:java}
> private BootStrapProgress checkProgress(FollowerInfo follower, long
> committed) {
> long confIndex = server.getRaftConf().getLogEntryIndex();
> long matchIndex = follower.getMatchIndex();
> LOG.info("{}: checkProgress for {}, commitIndex={}, conf'LogEntryIndex={}",
> this, follower, committed, confIndex);
> Preconditions.assertTrue(!isCaughtUp(follower));
> final Timestamp progressTime =
> Timestamp.currentTime().addTimeMs(-server.getMaxTimeoutMs());
> final Timestamp timeoutTime =
> Timestamp.currentTime().addTimeMs(-stagingTimeout.toLong(TimeUnit.MILLISECONDS));
> if (follower.getLastRpcResponseTime().compareTo(timeoutTime) < 0) {
> LOG.info("{} detects a follower {} timeout ({}ms) for bootstrapping",
> this, follower,
> follower.getLastRpcResponseTime().elapsedTimeMs());
> return BootStrapProgress.NOPROGRESS;
> } else if (matchIndex + stagingCatchupGap > committed
> && matchIndex >= confIndex
> && follower.getLastRpcResponseTime().compareTo(progressTime) > 0
> && follower.hasAttemptedToInstallSnapshot()) {
> LOG.info("{} detects a follower {} is caught up for bootstrapping,
> follower's matchIndex={}, Attempted={}", this, follower, matchIndex,
> follower.hasAttemptedToInstallSnapshot());
> return BootStrapProgress.CAUGHTUP;
> } else {
> LOG.info("{} detects a follower {} is making progress for bootstrapping,
> follower's matchIndex={}, Attempted={}", this, follower, matchIndex,
> follower.hasAttemptedToInstallSnapshot());
> return BootStrapProgress.PROGRESSING;
> }
> } {code}
> h3. *Proposed Fix:*
> {*}Adjust {{shouldInstallSnapshot()}} Logic{*}: Allow snapshot installation
> for both _new followers_ in {{stagingState}} and {_}restarted old
> followers{_}.
> !image-2025-04-22-15-57-53-633.png|width=849,height=264!
> h3. Pull Request
> https://github.com/apache/ratis/pull/1250
--
This message was sent by Atlassian Jira
(v8.20.10#820010)