SzyWilliam commented on PR #1091:
URL: https://github.com/apache/ratis/pull/1091#issuecomment-2109363882

   @szetszwo, thanks for reviewing this!
   
   The original `configurationManager` before the patch will produce another 
`IllegalStateException` as follows:
   ```java
   2024-05-14 13:46:03,722 [grpc-default-executor-8] WARN  
server.GrpcServerProtocolService (LogUtils.java:warn(123)) - s4: Failed 
INSTALL_SNAPSHOT request cid=0, isHeartbeat? false
   java.lang.IllegalStateException
        at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:35)
        at 
org.apache.ratis.server.impl.ConfigurationManager.addConfiguration(ConfigurationManager.java:68)
        at 
org.apache.ratis.server.impl.ServerState.setRaftConf(ServerState.java:382)
        at 
org.apache.ratis.server.impl.ServerState.setRaftConf(ServerState.java:375)
        at 
org.apache.ratis.server.impl.SnapshotInstallationHandler.installSnapshotImpl(SnapshotInstallationHandler.java:138)
        at 
org.apache.ratis.server.impl.SnapshotInstallationHandler.installSnapshot(SnapshotInstallationHandler.java:97)
        at 
org.apache.ratis.server.impl.RaftServerImpl.installSnapshot(RaftServerImpl.java:1652)
        at 
org.apache.ratis.server.impl.RaftServerProxy.installSnapshot(RaftServerProxy.java:674)
        at 
org.apache.ratis.grpc.server.GrpcServerProtocolService$2.process(GrpcServerProtocolService.java:341)
        at 
org.apache.ratis.grpc.server.GrpcServerProtocolService$2.process(GrpcServerProtocolService.java:338)
        at 
org.apache.ratis.grpc.server.GrpcServerProtocolService$ServerRequestStreamObserver.process(GrpcServerProtocolService.java:106)
        at 
org.apache.ratis.grpc.server.GrpcServerProtocolService$ServerRequestStreamObserver.onNext(GrpcServerProtocolService.java:174)
        at 
org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:262)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:329)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:314)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:833)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
   ```
   
   The event sequence that triggers the exception is as follows (a little bit 
tricky):
   1. Original raft group [A,B,C] with A as the leader.
   2. Leader A receives setConfiguration [A,B,C,D,E] from client. This setConf 
request will form a configuration change log L0 (old=[A,B,C], new=[A,B,C,D,E]).
   3. Leader A replicates the snapshot and logs(up to L0, which is uncommitted) 
to D, E. 
   4. D and E receive L0 and they both append this configuration to the 
`configurationManager`. 
   5. B and C do not receive L0 so the L0 remains uncommitted.
   6. Unfortunately, leader A crashes, this client request setConfiguration 
also fails.
   7. B is elected as the new leader. The first thing B comes to power is to 
replicate a group configuration log L1 (old=[A,B,C], new=[A,B,C]) to establish 
its leader authority. NOTICE this log L1 shares the same index as L0 (the first 
uncommited index).
   8. Client retries the setConfiguration [A,B,C,D,E]. B tries to append log L1 
to D and E. Since D and E's `configurationManager` already contain a 
configuration at index L0, this new configuration L1 will be rejected and the 
`IllegalStateException` will be thrown.
   
   So back to the question, I think we should not allow two confs with the same 
**committed** index. For the uncommitted confs, we should force it to be 
aligned with the current leader (just as Raft Algo will truncates those 
conflicting local uncommitted logs). What do you think?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to