[ 
https://issues.apache.org/jira/browse/RATIS-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17902266#comment-17902266
 ] 

Xinyu Tan commented on RATIS-2203:
----------------------------------

Leader:

{code:java}

2024-12-02 12:13:04,495 [18-server-thread1] INFO  o.a.r.s.i.RaftServerImpl:1296 
- 18@group-00020000002A: receive setConfiguration 
SetConfigurationRequest:client-A8BD159712B0->18@group-00020000002A, cid=1, 
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750, 
18|172.16.2.18:10750], listeners:[] 
2024-12-02 12:13:04,497 [18-server-thread1] INFO  o.a.r.s.i.LeaderStateImpl:485 
- 18@group-00020000002A-LeaderStateImpl: startSetConfiguration 
SetConfigurationRequest:client-A8BD159712B0->18@group-00020000002A, cid=1, 
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750, 
18|172.16.2.18:10750], listeners:[] 
2024-12-02 12:13:04,525 
[18@group-00020000002A->5-GrpcLogAppender-LogAppenderDaemon] INFO  
o.a.r.s.i.FollowerInfoImpl:154 - Follower 18@group-00020000002A->5 acknowledged 
installing snapshot 
2024-12-02 12:13:04,535 
[18@group-00020000002A->5-GrpcLogAppender-LogAppenderDaemon] INFO  
o.a.r.g.s.GrpcServerProtocolClient:65 - Build channel for 5|172.16.2.4:10750 
2024-12-02 12:13:14,753 [timer2] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
18@group-00020000002A->5-GrpcLogAppender: Timed out appendEntries, 
errorCount=1, 
request=AppendEntriesRequest:cid=1,entriesCount=9455,entries=(t:1, i:0)...(t:1, 
i:9454) 
2024-12-02 12:13:15,211 [timer3] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
errorCount=2, request=AppendEntriesRequest:cid=2,entriesCount=0 
2024-12-02 12:13:15,711 [timer4] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
errorCount=3, request=AppendEntriesRequest:cid=3,entriesCount=0 
2024-12-02 12:13:16,214 [timer5] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
errorCount=4, request=AppendEntriesRequest:cid=4,entriesCount=0 
2024-12-02 12:13:16,717 [timer6] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
errorCount=5, request=AppendEntriesRequest:cid=5,entriesCount=0 
2024-12-02 12:13:17,219 [timer7] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
errorCount=6, request=AppendEntriesRequest:cid=6,entriesCount=0 
2024-12-02 12:13:18,270 [timer0] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
errorCount=7, request=AppendEntriesRequest:cid=8,entriesCount=0 
2024-12-02 12:13:18,479 [18-server-thread1] INFO  o.a.r.s.i.RaftServerImpl:1296 
- 18@group-00020000002A: receive setConfiguration 
SetConfigurationRequest:client-A8BD159712B0->18@group-00020000002A, cid=1, 
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750, 
18|172.16.2.18:10750], listeners:[] 
2024-12-02 12:13:18,480 [grpc-default-executor-0] WARN  
o.a.r.g.s.GrpcAdminProtocolService:79 - Failed to setConfiguration: 
SetConfigurationRequest:client-A8BD159712B0->18@group-00020000002A, cid=1, 
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750, 
18|172.16.2.18:10750], listeners:[] 
2024-12-02 12:13:18,772 [timer1] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
errorCount=8, request=AppendEntriesRequest:cid=9,entriesCount=0 
2024-12-02 12:13:19,274 [timer2] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
errorCount=9, request=AppendEntriesRequest:cid=10,entriesCount=0 
2024-12-02 12:13:19,774 [timer3] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
errorCount=10, request=AppendEntriesRequest:cid=11,entriesCount=0 
2024-12-02 12:13:20,276 [timer4] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
errorCount=11, request=AppendEntriesRequest:cid=12,entriesCount=0 
2024-12-02 12:13:20,777 [timer5] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
errorCount=12, request=AppendEntriesRequest:cid=13,entriesCount=0 
2024-12-02 12:13:21,278 [timer6] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
errorCount=13, request=AppendEntriesRequest:cid=14,entriesCount=0 
2024-12-02 12:13:21,780 [timer7] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
errorCount=14, request=AppendEntriesRequest:cid=15,entriesCount=0 
2024-12-02 12:13:22,282 [timer0] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
errorCount=15, request=AppendEntriesRequest:cid=16,entriesCount=0 
2024-12-02 12:13:22,528 [18-server-thread1] INFO  o.a.r.s.i.RaftServerImpl:1296 
- 18@group-00020000002A: receive setConfiguration 
SetConfigurationRequest:client-A8BD159712B0->18@group-00020000002A, cid=1, 
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750, 
18|172.16.2.18:10750], listeners:[] 
2024-12-02 12:13:22,528 [grpc-default-executor-0] WARN  
o.a.r.g.s.GrpcAdminProtocolService:79 - Failed to setConfiguration: 
SetConfigurationRequest:client-A8BD159712B0->18@group-00020000002A, cid=1, 
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750, 
18|172.16.2.18:10750], listeners:[] 
2024-12-02 12:13:22,629 
[18@group-00020000002A->5-GrpcLogAppender-LogAppenderDaemon] WARN  
o.a.r.g.s.GrpcLogAppender:293 - 18@group-00020000002A->5-GrpcLogAppender: Wait 
interrupted by java.lang.InterruptedException 
2024-12-02 12:13:22,629 [18@group-00020000002A-LeaderStateImpl] WARN  
o.a.r.g.s.GrpcAdminProtocolService:79 - Failed to setConfiguration: 
SetConfigurationRequest:client-A8BD159712B0->18@group-00020000002A, cid=1, 
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750, 
18|172.16.2.18:10750], listeners:[] 
java.util.concurrent.CompletionException: 
org.apache.ratis.protocol.exceptions.ReconfigurationTimeoutException: 
18@group-00020000002A-ConfigurationStagingState: Fail to set configuration 
peers:[5|172.16.2.4:10750, 18|172.16.2.18:10750]|listeners:[] due to NOPROGRESS
Caused by: 
org.apache.ratis.protocol.exceptions.ReconfigurationTimeoutException: 
18@group-00020000002A-ConfigurationStagingState: Fail to set configuration 
peers:[5|172.16.2.4:10750, 18|172.16.2.18:10750]|listeners:[] due to NOPROGRESS
2024-12-02 12:13:22,784 [timer1] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
errorCount=16, request=AppendEntriesRequest:cid=17,entriesCount=0 
2024-12-02 12:13:23,286 [timer2] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
errorCount=17, request=AppendEntriesRequest:cid=18,entriesCount=0 
2024-12-02 12:13:23,787 [timer3] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
errorCount=18, request=AppendEntriesRequest:cid=19,entriesCount=0 
2024-12-02 12:13:24,289 [timer4] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
errorCount=19, request=AppendEntriesRequest:cid=20,entriesCount=0 
2024-12-02 12:13:24,793 [timer3] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
errorCount=20, request=AppendEntriesRequest:cid=21,entriesCount=0 
2024-12-02 12:13:25,179 [grpc-default-executor-1] INFO  
o.a.r.g.s.GrpcLogAppender$AppendLogResponseHandler:550 - 
18@group-00020000002A->5-AppendLogResponseHandler: follower responses 
appendEntries COMPLETED 
2024-12-02 12:13:25,183 [grpc-default-executor-1] INFO  
o.a.r.s.i.FollowerInfoImpl:70 - 18@group-00020000002A->5: computeNextIndex 
nextIndex: updateUnconditionally 9455 -> 0 
2024-12-02 12:13:25,306 [grpc-default-executor-1] INFO  
o.a.r.g.s.GrpcLogAppender$AppendLogResponseHandler:550 - 
18@group-00020000002A->5-AppendLogResponseHandler: follower responses 
appendEntries COMPLETED 
2024-12-02 12:13:26,568 [18-server-thread1] INFO  o.a.r.s.i.RaftServerImpl:1296 
- 18@group-00020000002A: receive setConfiguration 
SetConfigurationRequest:client-A8BD159712B0->18@group-00020000002A, cid=1, 
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750, 
18|172.16.2.18:10750], listeners:[] 
2024-12-02 12:13:26,568 [18-server-thread1] INFO  o.a.r.s.i.LeaderStateImpl:485 
- 18@group-00020000002A-LeaderStateImpl: startSetConfiguration 
SetConfigurationRequest:client-A8BD159712B0->18@group-00020000002A, cid=1, 
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750, 
18|172.16.2.18:10750], listeners:[] 
2024-12-02 12:13:26,570 
[18@group-00020000002A->5-GrpcLogAppender-LogAppenderDaemon] INFO  
o.a.r.s.i.FollowerInfoImpl:154 - Follower 18@group-00020000002A->5 acknowledged 
installing snapshot 
2024-12-02 12:13:26,601 [grpc-default-executor-0] WARN  
o.a.r.g.s.GrpcLogAppender$AppendLogResponseHandler:519 - 
18@group-00020000002A->5-AppendLogResponseHandler: received INCONSISTENCY reply 
with nextIndex 9455, errorCount=1, 
request=AppendEntriesRequest:cid=1,entriesCount=9455,entries=(t:1, i:0)...(t:1, 
i:9454) 
{code}

new follower
{code:java}
2024-12-02 12:13:01,911 [pool-30-IoTDB-DataNodeInternalRPC-Processor-179] INFO  
o.a.r.s.i.RaftServerProxy$ImplMap:100 - 5: addNew group-00020000002A:[] returns 
group-00020000002A:java.util.concurrent.CompletableFuture@699d1462[Not 
completed] 
2024-12-02 12:13:01,911 [5-groupManagement] INFO  o.a.r.s.i.RaftServerImpl:255 
- 5: new RaftServerImpl for group-00020000002A:[] with 
ApplicationStateMachineProxy:uninitialized 
2024-12-02 12:13:01,912 [5-groupManagement] INFO  o.a.r.s.impl.ServerState:114 
- 5@group-00020000002A: ConfigurationManager, init=conf: {index: -1, 
cur=peers:[]|listeners:[], old=null}, confs=<EMPTY_MAP> 
2024-12-02 12:13:01,915 [5-groupManagement] INFO  
o.a.r.s.s.RaftStorageDirectoryImpl:138 - The storage directory 
/data1/iotdb/v_debug_sr_adding_1202/data/datanode/consensus/schema_region/47474747-4747-4747-4747-00020000002a
 does not exist. Creating ... 
2024-12-02 12:13:01,951 [5-groupManagement] INFO  
o.a.r.s.s.RaftStorageDirectoryImpl:229 - Lock on 
/data1/iotdb/v_debug_sr_adding_1202/data/datanode/consensus/schema_region/47474747-4747-4747-4747-00020000002a/in_use.lock
 acquired by nodename 21019@dn4 
2024-12-02 12:13:01,954 [5-groupManagement] INFO  o.a.r.s.s.RaftStorageImpl:98 
- Storage directory 
/data1/iotdb/v_debug_sr_adding_1202/data/datanode/consensus/schema_region/47474747-4747-4747-4747-00020000002a
 has been successfully formatted. 
2024-12-02 12:13:01,956 [5-groupManagement] INFO  o.a.ratis.util.AwaitToRun:81 
- Thread[5@group-00020000002A-cacheEviction-AwaitToRun,5,main] started 
2024-12-02 12:13:01,956 [5-groupManagement] INFO  
o.a.r.s.r.s.SegmentedRaftLogWorker:188 - new 
5@group-00020000002A-SegmentedRaftLogWorker for RaftStorageImpl:Storage 
Directory 
/data1/iotdb/v_debug_sr_adding_1202/data/datanode/consensus/schema_region/47474747-4747-4747-4747-00020000002a
 
2024-12-02 12:13:01,966 [5-groupManagement] INFO  
o.a.r.s.r.s.SegmentedRaftLogWorker:139 - 
5@group-00020000002A-SegmentedRaftLogWorker: flushIndex: setUnconditionally 0 
-> -1 
2024-12-02 12:13:01,966 [5-groupManagement] INFO  
o.a.r.s.r.s.SegmentedRaftLogWorker:139 - 
5@group-00020000002A-SegmentedRaftLogWorker: safeCacheEvictIndex: 
setUnconditionally 0 -> -1 
2024-12-02 12:13:01,966 [5-groupManagement] INFO  o.a.r.s.i.RaftServerImpl:397 
- 5@group-00020000002A: start with initializing state, conf=conf: {index: -1, 
cur=peers:[]|listeners:[], old=null} 
2024-12-02 12:13:01,966 [5-groupManagement] INFO  o.a.r.s.i.RaftServerImpl:378 
- 5@group-00020000002A: changes role from      null to FOLLOWER at term 0 for 
start 
2024-12-02 12:13:01,967 [5-groupManagement] INFO  o.a.r.util.JmxRegister:44 - 
Successfully registered JMX Bean with object name 
Ratis:service=RaftServer,group=group-00020000002A,id=5 
2024-12-02 12:13:01,967 [5-groupManagement] INFO  o.a.r.util.JmxRegister:58 - 
register mxBean class org.apache.ratis.server.impl.RaftServerJmxAdapter as 
Ratis:service=RaftServer,group=group-00020000002A,id=5 
2024-12-02 12:13:01,968 [5-groupManagement] INFO  o.a.r.s.i.RaftServerImpl:405 
- 5@group-00020000002A: Successfully started. 
2024-12-02 12:13:02,333 [5-server-thread1] INFO  o.a.r.s.i.RaftServerImpl:1296 
- 5@group-00020000002A: receive setConfiguration 
SetConfigurationRequest:client-A8BD159712B0->5@group-00020000002A, cid=1, 
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750, 
18|172.16.2.18:10750], listeners:[] 
2024-12-02 12:13:02,334 [grpc-default-executor-20] WARN  
o.a.r.g.s.GrpcAdminProtocolService:79 - Failed to setConfiguration: 
SetConfigurationRequest:client-A8BD159712B0->5@group-00020000002A, cid=1, 
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750, 
18|172.16.2.18:10750], listeners:[] 
java.util.concurrent.CompletionException: 
org.apache.ratis.protocol.exceptions.ServerNotReadyException: 
5@group-00020000002A is not in [RUNNING]: current state is STARTING
Caused by: org.apache.ratis.protocol.exceptions.ServerNotReadyException: 
5@group-00020000002A is not in [RUNNING]: current state is STARTING
2024-12-02 12:13:07,251 [5-server-thread1] INFO  o.a.r.s.i.RaftServerImpl:562 - 
5@group-00020000002A: set firstElectionSinceStartup to false for appendEntries 
2024-12-02 12:13:07,252 [5-server-thread1] INFO  o.a.r.s.impl.ServerState:272 - 
5@group-00020000002A: change Leader from null to 18 at term 1 for 
appendEntries, leader elected after 5338ms 
2024-12-02 12:13:07,252 [5-server-thread2] INFO  o.a.r.s.impl.ServerState:386 - 
5@group-00020000002A: set configuration conf: {index: 0, 
cur=peers:[18|172.16.2.18:10750]|listeners:[], old=null} 
2024-12-02 12:13:07,253 [5-server-thread2] INFO  
o.a.r.s.r.s.SegmentedRaftLogWorker:438 - 
5@group-00020000002A-SegmentedRaftLogWorker: Starting segment from index:0 
2024-12-02 12:13:07,254 [5@group-00020000002A-SegmentedRaftLogWorker] INFO  
o.a.r.s.r.s.BufferedWriteChannel:61 - open log_inprogress_0 at position 0 
2024-12-02 12:13:16,200 [5@group-00020000002A-SegmentedRaftLogWorker] INFO  
o.a.r.s.r.s.SegmentedRaftLogWorker$StartLogSegment:645 - 
5@group-00020000002A-SegmentedRaftLogWorker: created new log segment 
/data1/iotdb/v_debug_sr_adding_1202/data/datanode/consensus/schema_region/47474747-4747-4747-4747-00020000002a/current/log_inprogress_0
 
2024-12-02 12:13:16,463 [5-server-thread21] INFO  o.a.r.s.i.RaftServerImpl:1296 
- 5@group-00020000002A: receive setConfiguration 
SetConfigurationRequest:client-A8BD159712B0->5@group-00020000002A, cid=1, 
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750, 
18|172.16.2.18:10750], listeners:[] 
2024-12-02 12:13:16,464 [grpc-default-executor-22] WARN  
o.a.r.g.s.GrpcAdminProtocolService:79 - Failed to setConfiguration: 
SetConfigurationRequest:client-A8BD159712B0->5@group-00020000002A, cid=1, 
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750, 
18|172.16.2.18:10750], listeners:[] 
java.util.concurrent.CompletionException: 
org.apache.ratis.protocol.exceptions.ServerNotReadyException: 
5@group-00020000002A is not in [RUNNING]: current state is STARTING
Caused by: org.apache.ratis.protocol.exceptions.ServerNotReadyException: 
5@group-00020000002A is not in [RUNNING]: current state is STARTING
2024-12-02 12:13:20,513 [5-server-thread21] INFO  o.a.r.s.i.RaftServerImpl:1296 
- 5@group-00020000002A: receive setConfiguration 
SetConfigurationRequest:client-A8BD159712B0->5@group-00020000002A, cid=1, 
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750, 
18|172.16.2.18:10750], listeners:[] 
2024-12-02 12:13:20,514 [grpc-default-executor-22] WARN  
o.a.r.g.s.GrpcAdminProtocolService:79 - Failed to setConfiguration: 
SetConfigurationRequest:client-A8BD159712B0->5@group-00020000002A, cid=1, 
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750, 
18|172.16.2.18:10750], listeners:[] 
java.util.concurrent.CompletionException: 
org.apache.ratis.protocol.exceptions.ServerNotReadyException: 
5@group-00020000002A is not in [RUNNING]: current state is STARTING
Caused by: org.apache.ratis.protocol.exceptions.ServerNotReadyException: 
5@group-00020000002A is not in [RUNNING]: current state is STARTING
2024-12-02 12:13:24,553 [5-server-thread21] INFO  o.a.r.s.i.RaftServerImpl:1296 
- 5@group-00020000002A: receive setConfiguration 
SetConfigurationRequest:client-A8BD159712B0->5@group-00020000002A, cid=1, 
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750, 
18|172.16.2.18:10750], listeners:[] 
2024-12-02 12:13:24,553 [grpc-default-executor-22] WARN  
o.a.r.g.s.GrpcAdminProtocolService:79 - Failed to setConfiguration: 
SetConfigurationRequest:client-A8BD159712B0->5@group-00020000002A, cid=1, 
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750, 
18|172.16.2.18:10750], listeners:[] 
java.util.concurrent.CompletionException: 
org.apache.ratis.protocol.exceptions.ServerNotReadyException: 
5@group-00020000002A is not in [RUNNING]: current state is STARTING
Caused by: org.apache.ratis.protocol.exceptions.ServerNotReadyException: 
5@group-00020000002A is not in [RUNNING]: current state is STARTING
2024-12-02 12:13:26,597 [5-server-thread21] INFO  o.a.r.s.i.RaftServerImpl:1638 
- 5@group-00020000002A: Failed appendEntries as the first entry (index 0) 
already exists (snapshotIndex: -1, commitIndex: 9454) 
2024-12-02 12:13:26,597 [5-server-thread21] INFO  o.a.r.s.i.RaftServerImpl:1580 
- 5@group-00020000002A: appendEntries* reply 
18<-5#1:FAIL-t1,INCONSISTENCY,nextIndex=9455,followerCommit=9454,matchIndex=-1 
2024-12-02 12:13:38,575 [5-server-thread20] INFO  o.a.r.s.i.RaftServerImpl:1296 
- 5@group-00020000002A: receive setConfiguration 
SetConfigurationRequest:client-A8BD159712B0->5@group-00020000002A, cid=1, 
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750, 
18|172.16.2.18:10750], listeners:[] 
{code}

It seems that the old leader did not have a snapshot and tried to synchronize 
the state via appendEntries. However, the initialization was quite slow, 
causing the process to exceed the 3 RPC timeout duration without returning any 
RPC responses, ultimately causing the migration task to fail. However, the 
client seems to keep retrying, and subsequent retries all fail.

There are two things we need to consider here:

- Why was the initial initialization so slow, causing the migration task to 
fail? Considering Java's GC, are there parameters we can adjust to control 
this? (Currently, IoTDB's default rpcTimeoutMax is set to 4 seconds for the 
maximum election timeout). Are there any optimizations on the kernel level that 
can be made to remove these magic number?

- If this migration task fails, are there any resources that were not properly 
released, preventing the second attempt from succeeding? Or should we delete 
and recreate the new peer in new node to continue with the migration?

[~szetszwo] What do you think?





> Fixed issue where Ratis member changes are stuck
> ------------------------------------------------
>
>                 Key: RATIS-2203
>                 URL: https://issues.apache.org/jira/browse/RATIS-2203
>             Project: Ratis
>          Issue Type: Improvement
>            Reporter: Xinyu Tan
>            Assignee: Xinyu Tan
>            Priority: Major
>
> Currently, in IoTDB, we use ratisclient to submit membership change requests, 
> and rely on the 
> [retrypolicy|https://github.com/apache/iotdb/blob/master/iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/ratis/RatisClient.java#L236-L249]
>  to ensure that membership change requests with longer durations can 
> eventually be successfully completed.
> In a recent test, we discovered that Ratis might get stuck during a 
> membership change. After a series of troubleshooting steps, we have mostly 
> determined that this is a bug in the Ratis core.
> Now, I will list the logs of the leader and the new replica to see if we can 
> analyze the cause of the error together.
> leader
> {code:java}
> 2024-11-29 11:50:49,166 [19-server-thread1] INFO  
> o.a.r.s.i.RaftServerImpl:1296 - 19@group-000200000061: receive 
> setConfiguration 
> SetConfigurationRequest:client-5FD616FE3614->19@group-000200000061, cid=1, 
> seq=null, RW, null, SET_UNCONDITIONALLY, servers:[4|172.16.2.3:10750, 
> 19|172.16.2.18:10750], listeners:[] 
> 2024-11-29 11:50:49,168 [19-server-thread1] INFO  
> o.a.r.s.i.LeaderStateImpl:485 - 19@group-000200000061-LeaderStateImpl: 
> startSetConfiguration 
> SetConfigurationRequest:client-5FD616FE3614->19@group-000200000061, cid=1, 
> seq=null, RW, null, SET_UNCONDITIONALLY, servers:[4|172.16.2.3:10750, 
> 19|172.16.2.18:10750], listeners:[] 
> 2024-11-29 11:50:49,208 
> [19@group-000200000061->4-GrpcLogAppender-LogAppenderDaemon] INFO  
> o.a.r.s.i.FollowerInfoImpl:154 - Follower 19@group-000200000061->4 
> acknowledged installing snapshot 
> 2024-11-29 11:50:49,215 
> [19@group-000200000061->4-GrpcLogAppender-LogAppenderDaemon] INFO  
> o.a.r.g.s.GrpcServerProtocolClient:65 - Build channel for 4|172.16.2.3:10750 
> 2024-11-29 11:50:59,412 [timer2] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
> 19@group-000200000061->4-GrpcLogAppender: Timed out appendEntries, 
> errorCount=1, 
> request=AppendEntriesRequest:cid=1,entriesCount=9198,entries=(t:1, 
> i:0)...(t:1, i:9197) 
> 2024-11-29 11:50:59,732 [timer3] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
> errorCount=2, request=AppendEntriesRequest:cid=2,entriesCount=0 
> 2024-11-29 11:51:00,238 [timer4] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
> errorCount=3, request=AppendEntriesRequest:cid=3,entriesCount=0 
> 2024-11-29 11:51:00,740 [timer5] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
> errorCount=4, request=AppendEntriesRequest:cid=4,entriesCount=0 
> 2024-11-29 11:51:01,906 [timer6] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
> errorCount=5, request=AppendEntriesRequest:cid=6,entriesCount=0 
> 2024-11-29 11:51:02,406 [timer7] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
> errorCount=6, request=AppendEntriesRequest:cid=7,entriesCount=0 
> 2024-11-29 11:51:02,908 [timer0] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
> errorCount=7, request=AppendEntriesRequest:cid=8,entriesCount=0 
> 2024-11-29 11:51:03,086 [19-server-thread1] INFO  
> o.a.r.s.i.RaftServerImpl:1296 - 19@group-000200000061: receive 
> setConfiguration 
> SetConfigurationRequest:client-5FD616FE3614->19@group-000200000061, cid=1, 
> seq=null, RW, null, SET_UNCONDITIONALLY, servers:[4|172.16.2.3:10750, 
> 19|172.16.2.18:10750], listeners:[] 
> 2024-11-29 11:51:03,410 [timer1] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
> errorCount=8, request=AppendEntriesRequest:cid=9,entriesCount=0 
> 2024-11-29 11:51:03,912 [timer2] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
> errorCount=9, request=AppendEntriesRequest:cid=10,entriesCount=0 
> 2024-11-29 11:51:04,414 [timer3] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
> errorCount=10, request=AppendEntriesRequest:cid=11,entriesCount=0 
> 2024-11-29 11:51:04,914 [timer4] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
> errorCount=11, request=AppendEntriesRequest:cid=12,entriesCount=0 
> 2024-11-29 11:51:05,415 [timer5] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
> errorCount=12, request=AppendEntriesRequest:cid=13,entriesCount=0 
> 2024-11-29 11:51:05,916 [timer6] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
> errorCount=13, request=AppendEntriesRequest:cid=14,entriesCount=0 
> 2024-11-29 11:51:06,418 [timer7] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
> errorCount=14, request=AppendEntriesRequest:cid=15,entriesCount=0 
> 2024-11-29 11:51:06,683 
> [19@group-000200000061->4-GrpcLogAppender-LogAppenderDaemon] WARN  
> o.a.r.g.s.GrpcLogAppender:293 - 19@group-000200000061->4-GrpcLogAppender: 
> Wait interrupted by java.lang.InterruptedException 
> 2024-11-29 11:51:06,921 [timer0] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
> errorCount=15, request=AppendEntriesRequest:cid=16,entriesCount=0 
> 2024-11-29 11:51:07,134 [19-server-thread1] INFO  
> o.a.r.s.i.RaftServerImpl:1296 - 19@group-000200000061: receive 
> setConfiguration 
> SetConfigurationRequest:client-5FD616FE3614->19@group-000200000061, cid=1, 
> seq=null, RW, null, SET_UNCONDITIONALLY, servers:[4|172.16.2.3:10750, 
> 19|172.16.2.18:10750], listeners:[] 
> 2024-11-29 11:51:07,135 [19-server-thread1] INFO  
> o.a.r.s.i.LeaderStateImpl:485 - 19@group-000200000061-LeaderStateImpl: 
> startSetConfiguration 
> SetConfigurationRequest:client-5FD616FE3614->19@group-000200000061, cid=1, 
> seq=null, RW, null, SET_UNCONDITIONALLY, servers:[4|172.16.2.3:10750, 
> 19|172.16.2.18:10750], listeners:[] 
> 2024-11-29 11:51:07,137 
> [19@group-000200000061->4-GrpcLogAppender-LogAppenderDaemon] INFO  
> o.a.r.s.i.FollowerInfoImpl:154 - Follower 19@group-000200000061->4 
> acknowledged installing snapshot 
> 2024-11-29 11:51:07,421 [timer1] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
> errorCount=16, request=AppendEntriesRequest:cid=17,entriesCount=0 
> 2024-11-29 11:51:07,924 [timer2] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
> errorCount=17, request=AppendEntriesRequest:cid=18,entriesCount=0 
> 2024-11-29 11:51:08,426 [timer3] WARN  o.a.r.g.s.GrpcLogAppender:441 - 
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries, 
> errorCount=18, request=AppendEntriesRequest:cid=19,entriesCount=0 
> 2024-11-29 11:51:08,950 [grpc-default-executor-5] WARN  
> o.a.r.g.s.GrpcLogAppender$AppendLogResponseHandler:519 - 
> 19@group-000200000061->4-AppendLogResponseHandler: received INCONSISTENCY 
> reply with nextIndex 1, errorCount=1, 
> request=AppendEntriesRequest:cid=1,entriesCount=9198,entries=(t:1, 
> i:0)...(t:1, i:9197) 
> 2024-11-29 11:51:08,951 [grpc-default-executor-5] INFO  
> o.a.r.s.i.FollowerInfoImpl:70 - 19@group-000200000061->4: setNextIndex 
> nextIndex: updateUnconditionally 9198 -> 1 
> 2024-11-29 11:51:08,972 [grpc-default-executor-5] INFO  
> o.a.r.g.s.GrpcLogAppender$AppendLogResponseHandler:550 - 
> 19@group-000200000061->4-AppendLogResponseHandler: follower responses 
> appendEntries COMPLETED 
> 2024-11-29 11:51:08,975 [grpc-default-executor-5] INFO  
> o.a.r.s.i.FollowerInfoImpl:70 - 19@group-000200000061->4: computeNextIndex 
> nextIndex: updateUnconditionally 9198 -> 0 
> 2024-11-29 11:51:09,015 [grpc-default-executor-5] WARN  
> o.a.r.g.s.GrpcLogAppender$AppendLogResponseHandler:519 - 
> 19@group-000200000061->4-AppendLogResponseHandler: received INCONSISTENCY 
> reply with nextIndex 9198, errorCount=1, 
> request=AppendEntriesRequest:cid=6,entriesCount=9197,entries=(t:1, 
> i:1)...(t:1, i:9197) 
> 2024-11-29 11:51:09,022 [grpc-default-executor-1] INFO  
> o.a.r.g.s.GrpcLogAppender$AppendLogResponseHandler:550 - 
> 19@group-000200000061->4-AppendLogResponseHandler: follower responses 
> appendEntries COMPLETED 
> {code}
> new adding follower
> {code:java}
> 2024-11-29 11:50:46,528 [4-groupManagement] INFO  
> o.a.r.s.i.RaftServerImpl:405 - 4@group-000200000061: Successfully started. 
> 2024-11-29 11:50:46,915 [4-server-thread1] INFO  
> o.a.r.s.i.RaftServerImpl:1296 - 4@group-000200000061: receive 
> setConfiguration 
> SetConfigurationRequest:client-5FD616FE3614->4@group-000200000061, cid=1, 
> seq=null, RW, null, SET_UNCONDITIONALLY, servers:[4|172.16.2.3:10750, 
> 19|172.16.2.18:10750], listeners:[] 
> 2024-11-29 11:50:50,886 [4-server-thread1] INFO  o.a.r.s.i.RaftServerImpl:562 
> - 4@group-000200000061: set firstElectionSinceStartup to false for 
> appendEntries 
> 2024-11-29 11:50:50,886 [4-server-thread1] INFO  
> o.a.i.d.c.s.s.SchemaRegionStateMachine:75 - Current node [nodeId: 4] is no 
> longer the schema region leader [regionId: SchemaRegion[97]], the new leader 
> is [nodeId:19] 
> 2024-11-29 11:50:50,886 [4-server-thread1] INFO  o.a.r.s.impl.ServerState:272 
> - 4@group-000200000061: change Leader from null to 19 at term 1 for 
> appendEntries, leader elected after 4575ms 
> 2024-11-29 11:50:50,887 [4-server-thread2] INFO  o.a.r.s.impl.ServerState:386 
> - 4@group-000200000061: set configuration conf: {index: 0, 
> cur=peers:[19|172.16.2.18:10750]|listeners:[], old=null} 
> 2024-11-29 11:50:50,887 [4-server-thread2] INFO  
> o.a.r.s.r.s.SegmentedRaftLogWorker:438 - 
> 4@group-000200000061-SegmentedRaftLogWorker: Starting segment from index:0 
> 2024-11-29 11:50:50,888 [4@group-000200000061-SegmentedRaftLogWorker] INFO  
> o.a.r.s.r.s.BufferedWriteChannel:61 - open log_inprogress_0 at position 0 
> 2024-11-29 11:50:56,801 [4@group-000200000061-SegmentedRaftLogWorker] INFO  
> o.a.r.s.r.s.SegmentedRaftLogWorker$StartLogSegment:645 - 
> 4@group-000200000061-SegmentedRaftLogWorker: created new log segment 
> /data1/iotdb/tablev2_rc5_1128_1fc99a6_1rep/data/datanode/consensus/schema_region/47474747-4747-4747-4747-000200000061/current/log_inprogress_0
>  
> 2024-11-29 11:51:01,063 [4-server-thread21] INFO  
> o.a.r.s.i.RaftServerImpl:1296 - 4@group-000200000061: receive 
> setConfiguration 
> SetConfigurationRequest:client-5FD616FE3614->4@group-000200000061, cid=1, 
> seq=null, RW, null, SET_UNCONDITIONALLY, servers:[4|172.16.2.3:10750, 
> 19|172.16.2.18:10750], listeners:[] 
> 2024-11-29 11:51:05,118 [4-server-thread21] INFO  
> o.a.r.s.i.RaftServerImpl:1296 - 4@group-000200000061: receive 
> setConfiguration 
> SetConfigurationRequest:client-5FD616FE3614->4@group-000200000061, cid=1, 
> seq=null, RW, null, SET_UNCONDITIONALLY, servers:[4|172.16.2.3:10750, 
> 19|172.16.2.18:10750], listeners:[] 
> 2024-11-29 11:51:08,942 [4-server-thread22] INFO  
> o.a.r.s.i.RaftServerImpl:1638 - 4@group-000200000061: Failed appendEntries as 
> the first entry (index 0) already exists (snapshotIndex: -1, commitIndex: 0) 
> 2024-11-29 11:51:08,943 [4-server-thread22] INFO  
> o.a.r.s.i.RaftServerImpl:1580 - 4@group-000200000061: appendEntries* reply 
> 19<-4#1:FAIL-t1,INCONSISTENCY,nextIndex=1,followerCommit=-1,matchIndex=-1 
> 2024-11-29 11:51:09,003 [4-server-thread2] INFO  
> o.a.r.s.i.RaftServerImpl:1638 - 4@group-000200000061: Failed appendEntries as 
> the first entry (index 1) already exists (snapshotIndex: -1, commitIndex: 
> 9197) 
> 2024-11-29 11:51:09,004 [4-server-thread2] INFO  
> o.a.r.s.i.RaftServerImpl:1580 - 4@group-000200000061: appendEntries* reply 
> 19<-4#6:FAIL-t1,INCONSISTENCY,nextIndex=9198,followerCommit=9197,matchIndex=-1
>  
> {code}
> From the logs, we can see that the synchronization link from the leader to 
> the new follower has been closed, so the logs cannot be committed. This 
> causes the membership change thread to be stuck indefinitely while waiting 
> for the follower to catch up.
> Based on the logs above, I have a few suspicions:
> - Why is the gRPC stream API on the leader side being completed? I haven't 
> seen any place on the follower side where this function is called.
> - The commitIndex obtained in the checkInconsistentAppendEntries function on 
> the follower side seems inconsistent with the one obtained in 
> appendEntriesAsync. For example, in the following two lines of logs, the same 
> thread gets a commitIndex of -1 outside the lock, and an index of 0 inside 
> the lock. This causes the subsequent synchronization to be completely stuck. 
> I'm not sure if this is a concurrency issue, but perhaps we should move this 
> line inside the 
> [lock|ttps://github.com/apache/ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L1591]?
> {code:java}
> 2024-11-29 11:51:08,942 [4-server-thread22] INFO  
> o.a.r.s.i.RaftServerImpl:1638 - 4@group-000200000061: Failed appendEntries as 
> the first entry (index 0) already exists (snapshotIndex: -1, commitIndex: 0) 
> 2024-11-29 11:51:08,943 [4-server-thread22] INFO  
> o.a.r.s.i.RaftServerImpl:1580 - 4@group-000200000061: appendEntries* reply 
> 19<-4#1:FAIL-t1,INCONSISTENCY,nextIndex=1,followerCommit=-1,matchIndex=-1 
> {code}
> What's your opinion? [~szetszwo]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to