[ 
https://issues.apache.org/jira/browse/RATIS-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved RATIS-2283.
-------------------------------
    Fix Version/s: 3.2.0
       Resolution: Fixed

The pull request is now merged.  Thanks, [~gewu12207] !

> GrpcLogAppender Thread Restart Leaves catchup=false, Blocking Reconfiguration 
> Progress
> --------------------------------------------------------------------------------------
>
>                 Key: RATIS-2283
>                 URL: https://issues.apache.org/jira/browse/RATIS-2283
>             Project: Ratis
>          Issue Type: Bug
>          Components: Leader
>            Reporter: Xinhao Gu
>            Assignee: Xinhao Gu
>            Priority: Major
>             Fix For: 3.2.0
>
>         Attachments: image-2025-04-22-15-57-53-633.png
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> h3. *Description*
> After the {{GrpcLogAppender}} thread restarts, the {{catchup}} flag in the 
> corresponding {{followerInfo}} remains {{{}false{}}}. This causes a critical 
> issue when performing a membership change (e.g., adding a new peer). 
> Specifically, {{checkProgress}} cannot complete, leading to indefinite 
> blocking during the reconfiguration phase.
> h3. Related Log 
> {code:java}
> 2025-03-14 17:57:43,342 [grpc-default-executor-41] WARN  
> o.a.r.g.s.GrpcAdminProtocolService:79 - Failed to setConfiguration: 
> SetConfigurationRequest:client-7BB03FC6049C->5@group-000200000014, cid=15, 
> seq=null, RW, null, SET_UNCONDITIONALLY, servers:[3|192.168.10.72:10750, 
> 4|192.168.10.73:10750, 5|192.168.10.74:10750, 7|192.168.10.76:10750], 
> listeners:[] 
> java.util.concurrent.CompletionException: 
> org.apache.ratis.protocol.exceptions.ReconfigurationInProgressException: 
> Reconfiguration is already in progress: conf: {index: 1, 
> cur=peers:[3|192.168.10.72:10750, 4|192.168.10.73:10750, 
> 5|192.168.10.74:10750]|listeners:[], old=null}
>     at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:122)
>     at 
> org.apache.ratis.server.impl.RaftServerImpl.lambda$executeSubmitServerRequestAsync$10(RaftServerImpl.java:902)
>     at 
> java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>     at java.base/java.lang.Thread.run(Thread.java:833)
> Caused by: 
> org.apache.ratis.protocol.exceptions.ReconfigurationInProgressException: 
> Reconfiguration is already in progress: conf: {index: 1, 
> cur=peers:[3|192.168.10.72:10750, 4|192.168.10.73:10750, 
> 5|192.168.10.74:10750]|listeners:[], old=null}
>     at 
> org.apache.ratis.server.impl.RaftServerImpl.setConfigurationAsync(RaftServerImpl.java:1320)
>     at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$null$24(RaftServerProxy.java:623)
>     at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:118)
>     ... 5 common frames omitted {code}
> h3. Cause Analysis
>  # *Direct Cause:*
> When the {{GrpcLogAppender}} restarts, its {{catchup}} flag stays 
> {{{}false{}}}, and {{hasAttemptedToInstallSnapshot}} is unsatisfied. This 
> prevents the {{stagingState}} phase of reconfiguration from completing.
>  # *Code-Level Issue:*
> In {{{}shouldInstallSnapshot(){}}}, the current logic only allows new 
> followers in the {{stagingState}} to trigger snapshot installation. Restarted 
> _old followers_ (e.g., after {{GrpcLogAppender}} restart) are excluded, 
> causing their {{catchup}} flag to remain {{false}} indefinitely.
> h3. Related Code
> {code:java}
> private BootStrapProgress checkProgress(FollowerInfo follower, long 
> committed) {
>   long confIndex = server.getRaftConf().getLogEntryIndex();
>   long matchIndex = follower.getMatchIndex();
>   LOG.info("{}: checkProgress for {}, commitIndex={}, conf'LogEntryIndex={}", 
> this, follower, committed, confIndex);
>   Preconditions.assertTrue(!isCaughtUp(follower));
>   final Timestamp progressTime = 
> Timestamp.currentTime().addTimeMs(-server.getMaxTimeoutMs());
>   final Timestamp timeoutTime = 
> Timestamp.currentTime().addTimeMs(-stagingTimeout.toLong(TimeUnit.MILLISECONDS));
>   if (follower.getLastRpcResponseTime().compareTo(timeoutTime) < 0) {
>     LOG.info("{} detects a follower {} timeout ({}ms) for bootstrapping", 
> this, follower,
>         follower.getLastRpcResponseTime().elapsedTimeMs());
>     return BootStrapProgress.NOPROGRESS;
>   } else if (matchIndex + stagingCatchupGap > committed
>       && matchIndex >= confIndex
>       && follower.getLastRpcResponseTime().compareTo(progressTime) > 0
>       && follower.hasAttemptedToInstallSnapshot()) {
>     LOG.info("{} detects a follower {} is caught up for bootstrapping, 
> follower's matchIndex={}, Attempted={}", this, follower, matchIndex, 
> follower.hasAttemptedToInstallSnapshot());
>     return BootStrapProgress.CAUGHTUP;
>   } else {
>     LOG.info("{} detects a follower {} is making progress for bootstrapping, 
> follower's matchIndex={}, Attempted={}", this, follower, matchIndex, 
> follower.hasAttemptedToInstallSnapshot());
>     return BootStrapProgress.PROGRESSING;
>   }
> } {code}
> h3. *Proposed Fix:*
> {*}Adjust {{shouldInstallSnapshot()}} Logic{*}: Allow snapshot installation 
> for both _new followers_ in {{stagingState}} and {_}restarted old 
> followers{_}.
> !image-2025-04-22-15-57-53-633.png|width=849,height=264!
> h3. Pull Request
> https://github.com/apache/ratis/pull/1250



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to