[ 
https://issues.apache.org/jira/browse/RATIS-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinhao GU updated RATIS-2283:
-----------------------------
    Description: 
h3. *Description*

After the {{GrpcLogAppender}} thread restarts, the {{catchup}} flag in the 
corresponding {{followerInfo}} remains {{{}false{}}}. This causes a critical 
issue when performing a membership change (e.g., adding a new peer). 
Specifically, {{checkProgress}} cannot complete, leading to indefinite blocking 
during the reconfiguration phase.
h3. Related Log 
{code:java}
2025-03-14 17:57:43,342 [grpc-default-executor-41] WARN  
o.a.r.g.s.GrpcAdminProtocolService:79 - Failed to setConfiguration: 
SetConfigurationRequest:client-7BB03FC6049C->5@group-000200000014, cid=15, 
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[3|192.168.10.72:10750, 
4|192.168.10.73:10750, 5|192.168.10.74:10750, 7|192.168.10.76:10750], 
listeners:[] 
java.util.concurrent.CompletionException: 
org.apache.ratis.protocol.exceptions.ReconfigurationInProgressException: 
Reconfiguration is already in progress: conf: {index: 1, 
cur=peers:[3|192.168.10.72:10750, 4|192.168.10.73:10750, 
5|192.168.10.74:10750]|listeners:[], old=null}
    at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:122)
    at 
org.apache.ratis.server.impl.RaftServerImpl.lambda$executeSubmitServerRequestAsync$10(RaftServerImpl.java:902)
    at 
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: 
org.apache.ratis.protocol.exceptions.ReconfigurationInProgressException: 
Reconfiguration is already in progress: conf: {index: 1, 
cur=peers:[3|192.168.10.72:10750, 4|192.168.10.73:10750, 
5|192.168.10.74:10750]|listeners:[], old=null}
    at 
org.apache.ratis.server.impl.RaftServerImpl.setConfigurationAsync(RaftServerImpl.java:1320)
    at 
org.apache.ratis.server.impl.RaftServerProxy.lambda$null$24(RaftServerProxy.java:623)
    at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:118)
    ... 5 common frames omitted {code}
h3. Cause Analysis
 # *Direct Cause:*
When the {{GrpcLogAppender}} restarts, its {{catchup}} flag stays 
{{{}false{}}}, and {{hasAttemptedToInstallSnapshot}} is unsatisfied. This 
prevents the {{stagingState}} phase of reconfiguration from completing.

 # *Code-Level Issue:*
In {{{}shouldInstallSnapshot(){}}}, the current logic only allows new followers 
in the {{stagingState}} to trigger snapshot installation. Restarted _old 
followers_ (e.g., after {{GrpcLogAppender}} restart) are excluded, causing 
their {{catchup}} flag to remain {{false}} indefinitely.

h3. Related Code
{code:java}
private BootStrapProgress checkProgress(FollowerInfo follower, long committed) {
  long confIndex = server.getRaftConf().getLogEntryIndex();
  long matchIndex = follower.getMatchIndex();
  LOG.info("{}: checkProgress for {}, commitIndex={}, conf'LogEntryIndex={}", 
this, follower, committed, confIndex);
  Preconditions.assertTrue(!isCaughtUp(follower));
  final Timestamp progressTime = 
Timestamp.currentTime().addTimeMs(-server.getMaxTimeoutMs());
  final Timestamp timeoutTime = 
Timestamp.currentTime().addTimeMs(-stagingTimeout.toLong(TimeUnit.MILLISECONDS));
  if (follower.getLastRpcResponseTime().compareTo(timeoutTime) < 0) {
    LOG.info("{} detects a follower {} timeout ({}ms) for bootstrapping", this, 
follower,
        follower.getLastRpcResponseTime().elapsedTimeMs());
    return BootStrapProgress.NOPROGRESS;
  } else if (matchIndex + stagingCatchupGap > committed
      && matchIndex >= confIndex
      && follower.getLastRpcResponseTime().compareTo(progressTime) > 0
      && follower.hasAttemptedToInstallSnapshot()) {
    LOG.info("{} detects a follower {} is caught up for bootstrapping, 
follower's matchIndex={}, Attempted={}", this, follower, matchIndex, 
follower.hasAttemptedToInstallSnapshot());
    return BootStrapProgress.CAUGHTUP;
  } else {
    LOG.info("{} detects a follower {} is making progress for bootstrapping, 
follower's matchIndex={}, Attempted={}", this, follower, matchIndex, 
follower.hasAttemptedToInstallSnapshot());
    return BootStrapProgress.PROGRESSING;
  }
} {code}
h3. *Proposed Fix:*

{*}Adjust {{shouldInstallSnapshot()}} Logic{*}: Allow snapshot installation for 
both _new followers_ in {{stagingState}} and {_}restarted old followers{_}.

!image-2025-04-22-15-57-53-633.png|width=849,height=264!
h3. Pull Request

https://github.com/apache/ratis/pull/1250

  was:
h3. *Description*

After the {{GrpcLogAppender}} thread restarts, the {{catchup}} flag in the 
corresponding {{followerInfo}} remains {{{}false{}}}. This causes a critical 
issue when performing a membership change (e.g., adding a new peer). 
Specifically, {{checkProgress}} cannot complete, leading to indefinite blocking 
during the reconfiguration phase.
h3. Related Log 
{code:java}
2025-03-14 17:57:43,342 [grpc-default-executor-41] WARN  
o.a.r.g.s.GrpcAdminProtocolService:79 - Failed to setConfiguration: 
SetConfigurationRequest:client-7BB03FC6049C->5@group-000200000014, cid=15, 
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[3|192.168.10.72:10750, 
4|192.168.10.73:10750, 5|192.168.10.74:10750, 7|192.168.10.76:10750], 
listeners:[] 
java.util.concurrent.CompletionException: 
org.apache.ratis.protocol.exceptions.ReconfigurationInProgressException: 
Reconfiguration is already in progress: conf: {index: 1, 
cur=peers:[3|192.168.10.72:10750, 4|192.168.10.73:10750, 
5|192.168.10.74:10750]|listeners:[], old=null}
    at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:122)
    at 
org.apache.ratis.server.impl.RaftServerImpl.lambda$executeSubmitServerRequestAsync$10(RaftServerImpl.java:902)
    at 
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: 
org.apache.ratis.protocol.exceptions.ReconfigurationInProgressException: 
Reconfiguration is already in progress: conf: {index: 1, 
cur=peers:[3|192.168.10.72:10750, 4|192.168.10.73:10750, 
5|192.168.10.74:10750]|listeners:[], old=null}
    at 
org.apache.ratis.server.impl.RaftServerImpl.setConfigurationAsync(RaftServerImpl.java:1320)
    at 
org.apache.ratis.server.impl.RaftServerProxy.lambda$null$24(RaftServerProxy.java:623)
    at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:118)
    ... 5 common frames omitted {code}
h3. Cause Analysis
 # *Direct Cause:*
When the {{GrpcLogAppender}} restarts, its {{catchup}} flag stays 
{{{}false{}}}, and {{hasAttemptedToInstallSnapshot}} is unsatisfied. This 
prevents the {{stagingState}} phase of reconfiguration from completing.

 # *Code-Level Issue:*
In {{{}shouldInstallSnapshot(){}}}, the current logic only allows new followers 
in the {{stagingState}} to trigger snapshot installation. Restarted _old 
followers_ (e.g., after {{GrpcLogAppender}} restart) are excluded, causing 
their {{catchup}} flag to remain {{false}} indefinitely.

h3. Related Code
{code:java}
private BootStrapProgress checkProgress(FollowerInfo follower, long committed) {
  long confIndex = server.getRaftConf().getLogEntryIndex();
  long matchIndex = follower.getMatchIndex();
  LOG.info("{}: checkProgress for {}, commitIndex={}, conf'LogEntryIndex={}", 
this, follower, committed, confIndex);
  Preconditions.assertTrue(!isCaughtUp(follower));
  final Timestamp progressTime = 
Timestamp.currentTime().addTimeMs(-server.getMaxTimeoutMs());
  final Timestamp timeoutTime = 
Timestamp.currentTime().addTimeMs(-stagingTimeout.toLong(TimeUnit.MILLISECONDS));
  if (follower.getLastRpcResponseTime().compareTo(timeoutTime) < 0) {
    LOG.info("{} detects a follower {} timeout ({}ms) for bootstrapping", this, 
follower,
        follower.getLastRpcResponseTime().elapsedTimeMs());
    return BootStrapProgress.NOPROGRESS;
  } else if (matchIndex + stagingCatchupGap > committed
      && matchIndex >= confIndex
      && follower.getLastRpcResponseTime().compareTo(progressTime) > 0
      && follower.hasAttemptedToInstallSnapshot()) {
    LOG.info("{} detects a follower {} is caught up for bootstrapping, 
follower's matchIndex={}, Attempted={}", this, follower, matchIndex, 
follower.hasAttemptedToInstallSnapshot());
    return BootStrapProgress.CAUGHTUP;
  } else {
    LOG.info("{} detects a follower {} is making progress for bootstrapping, 
follower's matchIndex={}, Attempted={}", this, follower, matchIndex, 
follower.hasAttemptedToInstallSnapshot());
    return BootStrapProgress.PROGRESSING;
  }
} {code}
h3. *Proposed Fix:*

{*}Adjust {{shouldInstallSnapshot()}} Logic{*}: Allow snapshot installation for 
both _new followers_ in {{stagingState}} and {_}restarted old followers{_}.

!image-2025-04-22-15-57-53-633.png|width=849,height=264!


> GrpcLogAppender Thread Restart Leaves catchup=false, Blocking Reconfiguration 
> Progress
> --------------------------------------------------------------------------------------
>
>                 Key: RATIS-2283
>                 URL: https://issues.apache.org/jira/browse/RATIS-2283
>             Project: Ratis
>          Issue Type: Bug
>            Reporter: Xinhao GU
>            Assignee: Xinhao GU
>            Priority: Major
>         Attachments: image-2025-04-22-15-57-53-633.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> h3. *Description*
> After the {{GrpcLogAppender}} thread restarts, the {{catchup}} flag in the 
> corresponding {{followerInfo}} remains {{{}false{}}}. This causes a critical 
> issue when performing a membership change (e.g., adding a new peer). 
> Specifically, {{checkProgress}} cannot complete, leading to indefinite 
> blocking during the reconfiguration phase.
> h3. Related Log 
> {code:java}
> 2025-03-14 17:57:43,342 [grpc-default-executor-41] WARN  
> o.a.r.g.s.GrpcAdminProtocolService:79 - Failed to setConfiguration: 
> SetConfigurationRequest:client-7BB03FC6049C->5@group-000200000014, cid=15, 
> seq=null, RW, null, SET_UNCONDITIONALLY, servers:[3|192.168.10.72:10750, 
> 4|192.168.10.73:10750, 5|192.168.10.74:10750, 7|192.168.10.76:10750], 
> listeners:[] 
> java.util.concurrent.CompletionException: 
> org.apache.ratis.protocol.exceptions.ReconfigurationInProgressException: 
> Reconfiguration is already in progress: conf: {index: 1, 
> cur=peers:[3|192.168.10.72:10750, 4|192.168.10.73:10750, 
> 5|192.168.10.74:10750]|listeners:[], old=null}
>     at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:122)
>     at 
> org.apache.ratis.server.impl.RaftServerImpl.lambda$executeSubmitServerRequestAsync$10(RaftServerImpl.java:902)
>     at 
> java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>     at java.base/java.lang.Thread.run(Thread.java:833)
> Caused by: 
> org.apache.ratis.protocol.exceptions.ReconfigurationInProgressException: 
> Reconfiguration is already in progress: conf: {index: 1, 
> cur=peers:[3|192.168.10.72:10750, 4|192.168.10.73:10750, 
> 5|192.168.10.74:10750]|listeners:[], old=null}
>     at 
> org.apache.ratis.server.impl.RaftServerImpl.setConfigurationAsync(RaftServerImpl.java:1320)
>     at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$null$24(RaftServerProxy.java:623)
>     at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:118)
>     ... 5 common frames omitted {code}
> h3. Cause Analysis
>  # *Direct Cause:*
> When the {{GrpcLogAppender}} restarts, its {{catchup}} flag stays 
> {{{}false{}}}, and {{hasAttemptedToInstallSnapshot}} is unsatisfied. This 
> prevents the {{stagingState}} phase of reconfiguration from completing.
>  # *Code-Level Issue:*
> In {{{}shouldInstallSnapshot(){}}}, the current logic only allows new 
> followers in the {{stagingState}} to trigger snapshot installation. Restarted 
> _old followers_ (e.g., after {{GrpcLogAppender}} restart) are excluded, 
> causing their {{catchup}} flag to remain {{false}} indefinitely.
> h3. Related Code
> {code:java}
> private BootStrapProgress checkProgress(FollowerInfo follower, long 
> committed) {
>   long confIndex = server.getRaftConf().getLogEntryIndex();
>   long matchIndex = follower.getMatchIndex();
>   LOG.info("{}: checkProgress for {}, commitIndex={}, conf'LogEntryIndex={}", 
> this, follower, committed, confIndex);
>   Preconditions.assertTrue(!isCaughtUp(follower));
>   final Timestamp progressTime = 
> Timestamp.currentTime().addTimeMs(-server.getMaxTimeoutMs());
>   final Timestamp timeoutTime = 
> Timestamp.currentTime().addTimeMs(-stagingTimeout.toLong(TimeUnit.MILLISECONDS));
>   if (follower.getLastRpcResponseTime().compareTo(timeoutTime) < 0) {
>     LOG.info("{} detects a follower {} timeout ({}ms) for bootstrapping", 
> this, follower,
>         follower.getLastRpcResponseTime().elapsedTimeMs());
>     return BootStrapProgress.NOPROGRESS;
>   } else if (matchIndex + stagingCatchupGap > committed
>       && matchIndex >= confIndex
>       && follower.getLastRpcResponseTime().compareTo(progressTime) > 0
>       && follower.hasAttemptedToInstallSnapshot()) {
>     LOG.info("{} detects a follower {} is caught up for bootstrapping, 
> follower's matchIndex={}, Attempted={}", this, follower, matchIndex, 
> follower.hasAttemptedToInstallSnapshot());
>     return BootStrapProgress.CAUGHTUP;
>   } else {
>     LOG.info("{} detects a follower {} is making progress for bootstrapping, 
> follower's matchIndex={}, Attempted={}", this, follower, matchIndex, 
> follower.hasAttemptedToInstallSnapshot());
>     return BootStrapProgress.PROGRESSING;
>   }
> } {code}
> h3. *Proposed Fix:*
> {*}Adjust {{shouldInstallSnapshot()}} Logic{*}: Allow snapshot installation 
> for both _new followers_ in {{stagingState}} and {_}restarted old 
> followers{_}.
> !image-2025-04-22-15-57-53-633.png|width=849,height=264!
> h3. Pull Request
> https://github.com/apache/ratis/pull/1250



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to