[
https://issues.apache.org/jira/browse/RATIS-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17902266#comment-17902266
]
Xinyu Tan commented on RATIS-2203:
----------------------------------
Leader:
{code:java}
2024-12-02 12:13:04,495 [18-server-thread1] INFO o.a.r.s.i.RaftServerImpl:1296
- 18@group-00020000002A: receive setConfiguration
SetConfigurationRequest:client-A8BD159712B0->18@group-00020000002A, cid=1,
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750,
18|172.16.2.18:10750], listeners:[]
2024-12-02 12:13:04,497 [18-server-thread1] INFO o.a.r.s.i.LeaderStateImpl:485
- 18@group-00020000002A-LeaderStateImpl: startSetConfiguration
SetConfigurationRequest:client-A8BD159712B0->18@group-00020000002A, cid=1,
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750,
18|172.16.2.18:10750], listeners:[]
2024-12-02 12:13:04,525
[18@group-00020000002A->5-GrpcLogAppender-LogAppenderDaemon] INFO
o.a.r.s.i.FollowerInfoImpl:154 - Follower 18@group-00020000002A->5 acknowledged
installing snapshot
2024-12-02 12:13:04,535
[18@group-00020000002A->5-GrpcLogAppender-LogAppenderDaemon] INFO
o.a.r.g.s.GrpcServerProtocolClient:65 - Build channel for 5|172.16.2.4:10750
2024-12-02 12:13:14,753 [timer2] WARN o.a.r.g.s.GrpcLogAppender:441 -
18@group-00020000002A->5-GrpcLogAppender: Timed out appendEntries,
errorCount=1,
request=AppendEntriesRequest:cid=1,entriesCount=9455,entries=(t:1, i:0)...(t:1,
i:9454)
2024-12-02 12:13:15,211 [timer3] WARN o.a.r.g.s.GrpcLogAppender:441 -
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
errorCount=2, request=AppendEntriesRequest:cid=2,entriesCount=0
2024-12-02 12:13:15,711 [timer4] WARN o.a.r.g.s.GrpcLogAppender:441 -
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
errorCount=3, request=AppendEntriesRequest:cid=3,entriesCount=0
2024-12-02 12:13:16,214 [timer5] WARN o.a.r.g.s.GrpcLogAppender:441 -
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
errorCount=4, request=AppendEntriesRequest:cid=4,entriesCount=0
2024-12-02 12:13:16,717 [timer6] WARN o.a.r.g.s.GrpcLogAppender:441 -
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
errorCount=5, request=AppendEntriesRequest:cid=5,entriesCount=0
2024-12-02 12:13:17,219 [timer7] WARN o.a.r.g.s.GrpcLogAppender:441 -
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
errorCount=6, request=AppendEntriesRequest:cid=6,entriesCount=0
2024-12-02 12:13:18,270 [timer0] WARN o.a.r.g.s.GrpcLogAppender:441 -
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
errorCount=7, request=AppendEntriesRequest:cid=8,entriesCount=0
2024-12-02 12:13:18,479 [18-server-thread1] INFO o.a.r.s.i.RaftServerImpl:1296
- 18@group-00020000002A: receive setConfiguration
SetConfigurationRequest:client-A8BD159712B0->18@group-00020000002A, cid=1,
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750,
18|172.16.2.18:10750], listeners:[]
2024-12-02 12:13:18,480 [grpc-default-executor-0] WARN
o.a.r.g.s.GrpcAdminProtocolService:79 - Failed to setConfiguration:
SetConfigurationRequest:client-A8BD159712B0->18@group-00020000002A, cid=1,
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750,
18|172.16.2.18:10750], listeners:[]
2024-12-02 12:13:18,772 [timer1] WARN o.a.r.g.s.GrpcLogAppender:441 -
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
errorCount=8, request=AppendEntriesRequest:cid=9,entriesCount=0
2024-12-02 12:13:19,274 [timer2] WARN o.a.r.g.s.GrpcLogAppender:441 -
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
errorCount=9, request=AppendEntriesRequest:cid=10,entriesCount=0
2024-12-02 12:13:19,774 [timer3] WARN o.a.r.g.s.GrpcLogAppender:441 -
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
errorCount=10, request=AppendEntriesRequest:cid=11,entriesCount=0
2024-12-02 12:13:20,276 [timer4] WARN o.a.r.g.s.GrpcLogAppender:441 -
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
errorCount=11, request=AppendEntriesRequest:cid=12,entriesCount=0
2024-12-02 12:13:20,777 [timer5] WARN o.a.r.g.s.GrpcLogAppender:441 -
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
errorCount=12, request=AppendEntriesRequest:cid=13,entriesCount=0
2024-12-02 12:13:21,278 [timer6] WARN o.a.r.g.s.GrpcLogAppender:441 -
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
errorCount=13, request=AppendEntriesRequest:cid=14,entriesCount=0
2024-12-02 12:13:21,780 [timer7] WARN o.a.r.g.s.GrpcLogAppender:441 -
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
errorCount=14, request=AppendEntriesRequest:cid=15,entriesCount=0
2024-12-02 12:13:22,282 [timer0] WARN o.a.r.g.s.GrpcLogAppender:441 -
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
errorCount=15, request=AppendEntriesRequest:cid=16,entriesCount=0
2024-12-02 12:13:22,528 [18-server-thread1] INFO o.a.r.s.i.RaftServerImpl:1296
- 18@group-00020000002A: receive setConfiguration
SetConfigurationRequest:client-A8BD159712B0->18@group-00020000002A, cid=1,
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750,
18|172.16.2.18:10750], listeners:[]
2024-12-02 12:13:22,528 [grpc-default-executor-0] WARN
o.a.r.g.s.GrpcAdminProtocolService:79 - Failed to setConfiguration:
SetConfigurationRequest:client-A8BD159712B0->18@group-00020000002A, cid=1,
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750,
18|172.16.2.18:10750], listeners:[]
2024-12-02 12:13:22,629
[18@group-00020000002A->5-GrpcLogAppender-LogAppenderDaemon] WARN
o.a.r.g.s.GrpcLogAppender:293 - 18@group-00020000002A->5-GrpcLogAppender: Wait
interrupted by java.lang.InterruptedException
2024-12-02 12:13:22,629 [18@group-00020000002A-LeaderStateImpl] WARN
o.a.r.g.s.GrpcAdminProtocolService:79 - Failed to setConfiguration:
SetConfigurationRequest:client-A8BD159712B0->18@group-00020000002A, cid=1,
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750,
18|172.16.2.18:10750], listeners:[]
java.util.concurrent.CompletionException:
org.apache.ratis.protocol.exceptions.ReconfigurationTimeoutException:
18@group-00020000002A-ConfigurationStagingState: Fail to set configuration
peers:[5|172.16.2.4:10750, 18|172.16.2.18:10750]|listeners:[] due to NOPROGRESS
Caused by:
org.apache.ratis.protocol.exceptions.ReconfigurationTimeoutException:
18@group-00020000002A-ConfigurationStagingState: Fail to set configuration
peers:[5|172.16.2.4:10750, 18|172.16.2.18:10750]|listeners:[] due to NOPROGRESS
2024-12-02 12:13:22,784 [timer1] WARN o.a.r.g.s.GrpcLogAppender:441 -
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
errorCount=16, request=AppendEntriesRequest:cid=17,entriesCount=0
2024-12-02 12:13:23,286 [timer2] WARN o.a.r.g.s.GrpcLogAppender:441 -
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
errorCount=17, request=AppendEntriesRequest:cid=18,entriesCount=0
2024-12-02 12:13:23,787 [timer3] WARN o.a.r.g.s.GrpcLogAppender:441 -
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
errorCount=18, request=AppendEntriesRequest:cid=19,entriesCount=0
2024-12-02 12:13:24,289 [timer4] WARN o.a.r.g.s.GrpcLogAppender:441 -
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
errorCount=19, request=AppendEntriesRequest:cid=20,entriesCount=0
2024-12-02 12:13:24,793 [timer3] WARN o.a.r.g.s.GrpcLogAppender:441 -
18@group-00020000002A->5-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
errorCount=20, request=AppendEntriesRequest:cid=21,entriesCount=0
2024-12-02 12:13:25,179 [grpc-default-executor-1] INFO
o.a.r.g.s.GrpcLogAppender$AppendLogResponseHandler:550 -
18@group-00020000002A->5-AppendLogResponseHandler: follower responses
appendEntries COMPLETED
2024-12-02 12:13:25,183 [grpc-default-executor-1] INFO
o.a.r.s.i.FollowerInfoImpl:70 - 18@group-00020000002A->5: computeNextIndex
nextIndex: updateUnconditionally 9455 -> 0
2024-12-02 12:13:25,306 [grpc-default-executor-1] INFO
o.a.r.g.s.GrpcLogAppender$AppendLogResponseHandler:550 -
18@group-00020000002A->5-AppendLogResponseHandler: follower responses
appendEntries COMPLETED
2024-12-02 12:13:26,568 [18-server-thread1] INFO o.a.r.s.i.RaftServerImpl:1296
- 18@group-00020000002A: receive setConfiguration
SetConfigurationRequest:client-A8BD159712B0->18@group-00020000002A, cid=1,
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750,
18|172.16.2.18:10750], listeners:[]
2024-12-02 12:13:26,568 [18-server-thread1] INFO o.a.r.s.i.LeaderStateImpl:485
- 18@group-00020000002A-LeaderStateImpl: startSetConfiguration
SetConfigurationRequest:client-A8BD159712B0->18@group-00020000002A, cid=1,
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750,
18|172.16.2.18:10750], listeners:[]
2024-12-02 12:13:26,570
[18@group-00020000002A->5-GrpcLogAppender-LogAppenderDaemon] INFO
o.a.r.s.i.FollowerInfoImpl:154 - Follower 18@group-00020000002A->5 acknowledged
installing snapshot
2024-12-02 12:13:26,601 [grpc-default-executor-0] WARN
o.a.r.g.s.GrpcLogAppender$AppendLogResponseHandler:519 -
18@group-00020000002A->5-AppendLogResponseHandler: received INCONSISTENCY reply
with nextIndex 9455, errorCount=1,
request=AppendEntriesRequest:cid=1,entriesCount=9455,entries=(t:1, i:0)...(t:1,
i:9454)
{code}
new follower
{code:java}
2024-12-02 12:13:01,911 [pool-30-IoTDB-DataNodeInternalRPC-Processor-179] INFO
o.a.r.s.i.RaftServerProxy$ImplMap:100 - 5: addNew group-00020000002A:[] returns
group-00020000002A:java.util.concurrent.CompletableFuture@699d1462[Not
completed]
2024-12-02 12:13:01,911 [5-groupManagement] INFO o.a.r.s.i.RaftServerImpl:255
- 5: new RaftServerImpl for group-00020000002A:[] with
ApplicationStateMachineProxy:uninitialized
2024-12-02 12:13:01,912 [5-groupManagement] INFO o.a.r.s.impl.ServerState:114
- 5@group-00020000002A: ConfigurationManager, init=conf: {index: -1,
cur=peers:[]|listeners:[], old=null}, confs=<EMPTY_MAP>
2024-12-02 12:13:01,915 [5-groupManagement] INFO
o.a.r.s.s.RaftStorageDirectoryImpl:138 - The storage directory
/data1/iotdb/v_debug_sr_adding_1202/data/datanode/consensus/schema_region/47474747-4747-4747-4747-00020000002a
does not exist. Creating ...
2024-12-02 12:13:01,951 [5-groupManagement] INFO
o.a.r.s.s.RaftStorageDirectoryImpl:229 - Lock on
/data1/iotdb/v_debug_sr_adding_1202/data/datanode/consensus/schema_region/47474747-4747-4747-4747-00020000002a/in_use.lock
acquired by nodename 21019@dn4
2024-12-02 12:13:01,954 [5-groupManagement] INFO o.a.r.s.s.RaftStorageImpl:98
- Storage directory
/data1/iotdb/v_debug_sr_adding_1202/data/datanode/consensus/schema_region/47474747-4747-4747-4747-00020000002a
has been successfully formatted.
2024-12-02 12:13:01,956 [5-groupManagement] INFO o.a.ratis.util.AwaitToRun:81
- Thread[5@group-00020000002A-cacheEviction-AwaitToRun,5,main] started
2024-12-02 12:13:01,956 [5-groupManagement] INFO
o.a.r.s.r.s.SegmentedRaftLogWorker:188 - new
5@group-00020000002A-SegmentedRaftLogWorker for RaftStorageImpl:Storage
Directory
/data1/iotdb/v_debug_sr_adding_1202/data/datanode/consensus/schema_region/47474747-4747-4747-4747-00020000002a
2024-12-02 12:13:01,966 [5-groupManagement] INFO
o.a.r.s.r.s.SegmentedRaftLogWorker:139 -
5@group-00020000002A-SegmentedRaftLogWorker: flushIndex: setUnconditionally 0
-> -1
2024-12-02 12:13:01,966 [5-groupManagement] INFO
o.a.r.s.r.s.SegmentedRaftLogWorker:139 -
5@group-00020000002A-SegmentedRaftLogWorker: safeCacheEvictIndex:
setUnconditionally 0 -> -1
2024-12-02 12:13:01,966 [5-groupManagement] INFO o.a.r.s.i.RaftServerImpl:397
- 5@group-00020000002A: start with initializing state, conf=conf: {index: -1,
cur=peers:[]|listeners:[], old=null}
2024-12-02 12:13:01,966 [5-groupManagement] INFO o.a.r.s.i.RaftServerImpl:378
- 5@group-00020000002A: changes role from null to FOLLOWER at term 0 for
start
2024-12-02 12:13:01,967 [5-groupManagement] INFO o.a.r.util.JmxRegister:44 -
Successfully registered JMX Bean with object name
Ratis:service=RaftServer,group=group-00020000002A,id=5
2024-12-02 12:13:01,967 [5-groupManagement] INFO o.a.r.util.JmxRegister:58 -
register mxBean class org.apache.ratis.server.impl.RaftServerJmxAdapter as
Ratis:service=RaftServer,group=group-00020000002A,id=5
2024-12-02 12:13:01,968 [5-groupManagement] INFO o.a.r.s.i.RaftServerImpl:405
- 5@group-00020000002A: Successfully started.
2024-12-02 12:13:02,333 [5-server-thread1] INFO o.a.r.s.i.RaftServerImpl:1296
- 5@group-00020000002A: receive setConfiguration
SetConfigurationRequest:client-A8BD159712B0->5@group-00020000002A, cid=1,
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750,
18|172.16.2.18:10750], listeners:[]
2024-12-02 12:13:02,334 [grpc-default-executor-20] WARN
o.a.r.g.s.GrpcAdminProtocolService:79 - Failed to setConfiguration:
SetConfigurationRequest:client-A8BD159712B0->5@group-00020000002A, cid=1,
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750,
18|172.16.2.18:10750], listeners:[]
java.util.concurrent.CompletionException:
org.apache.ratis.protocol.exceptions.ServerNotReadyException:
5@group-00020000002A is not in [RUNNING]: current state is STARTING
Caused by: org.apache.ratis.protocol.exceptions.ServerNotReadyException:
5@group-00020000002A is not in [RUNNING]: current state is STARTING
2024-12-02 12:13:07,251 [5-server-thread1] INFO o.a.r.s.i.RaftServerImpl:562 -
5@group-00020000002A: set firstElectionSinceStartup to false for appendEntries
2024-12-02 12:13:07,252 [5-server-thread1] INFO o.a.r.s.impl.ServerState:272 -
5@group-00020000002A: change Leader from null to 18 at term 1 for
appendEntries, leader elected after 5338ms
2024-12-02 12:13:07,252 [5-server-thread2] INFO o.a.r.s.impl.ServerState:386 -
5@group-00020000002A: set configuration conf: {index: 0,
cur=peers:[18|172.16.2.18:10750]|listeners:[], old=null}
2024-12-02 12:13:07,253 [5-server-thread2] INFO
o.a.r.s.r.s.SegmentedRaftLogWorker:438 -
5@group-00020000002A-SegmentedRaftLogWorker: Starting segment from index:0
2024-12-02 12:13:07,254 [5@group-00020000002A-SegmentedRaftLogWorker] INFO
o.a.r.s.r.s.BufferedWriteChannel:61 - open log_inprogress_0 at position 0
2024-12-02 12:13:16,200 [5@group-00020000002A-SegmentedRaftLogWorker] INFO
o.a.r.s.r.s.SegmentedRaftLogWorker$StartLogSegment:645 -
5@group-00020000002A-SegmentedRaftLogWorker: created new log segment
/data1/iotdb/v_debug_sr_adding_1202/data/datanode/consensus/schema_region/47474747-4747-4747-4747-00020000002a/current/log_inprogress_0
2024-12-02 12:13:16,463 [5-server-thread21] INFO o.a.r.s.i.RaftServerImpl:1296
- 5@group-00020000002A: receive setConfiguration
SetConfigurationRequest:client-A8BD159712B0->5@group-00020000002A, cid=1,
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750,
18|172.16.2.18:10750], listeners:[]
2024-12-02 12:13:16,464 [grpc-default-executor-22] WARN
o.a.r.g.s.GrpcAdminProtocolService:79 - Failed to setConfiguration:
SetConfigurationRequest:client-A8BD159712B0->5@group-00020000002A, cid=1,
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750,
18|172.16.2.18:10750], listeners:[]
java.util.concurrent.CompletionException:
org.apache.ratis.protocol.exceptions.ServerNotReadyException:
5@group-00020000002A is not in [RUNNING]: current state is STARTING
Caused by: org.apache.ratis.protocol.exceptions.ServerNotReadyException:
5@group-00020000002A is not in [RUNNING]: current state is STARTING
2024-12-02 12:13:20,513 [5-server-thread21] INFO o.a.r.s.i.RaftServerImpl:1296
- 5@group-00020000002A: receive setConfiguration
SetConfigurationRequest:client-A8BD159712B0->5@group-00020000002A, cid=1,
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750,
18|172.16.2.18:10750], listeners:[]
2024-12-02 12:13:20,514 [grpc-default-executor-22] WARN
o.a.r.g.s.GrpcAdminProtocolService:79 - Failed to setConfiguration:
SetConfigurationRequest:client-A8BD159712B0->5@group-00020000002A, cid=1,
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750,
18|172.16.2.18:10750], listeners:[]
java.util.concurrent.CompletionException:
org.apache.ratis.protocol.exceptions.ServerNotReadyException:
5@group-00020000002A is not in [RUNNING]: current state is STARTING
Caused by: org.apache.ratis.protocol.exceptions.ServerNotReadyException:
5@group-00020000002A is not in [RUNNING]: current state is STARTING
2024-12-02 12:13:24,553 [5-server-thread21] INFO o.a.r.s.i.RaftServerImpl:1296
- 5@group-00020000002A: receive setConfiguration
SetConfigurationRequest:client-A8BD159712B0->5@group-00020000002A, cid=1,
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750,
18|172.16.2.18:10750], listeners:[]
2024-12-02 12:13:24,553 [grpc-default-executor-22] WARN
o.a.r.g.s.GrpcAdminProtocolService:79 - Failed to setConfiguration:
SetConfigurationRequest:client-A8BD159712B0->5@group-00020000002A, cid=1,
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750,
18|172.16.2.18:10750], listeners:[]
java.util.concurrent.CompletionException:
org.apache.ratis.protocol.exceptions.ServerNotReadyException:
5@group-00020000002A is not in [RUNNING]: current state is STARTING
Caused by: org.apache.ratis.protocol.exceptions.ServerNotReadyException:
5@group-00020000002A is not in [RUNNING]: current state is STARTING
2024-12-02 12:13:26,597 [5-server-thread21] INFO o.a.r.s.i.RaftServerImpl:1638
- 5@group-00020000002A: Failed appendEntries as the first entry (index 0)
already exists (snapshotIndex: -1, commitIndex: 9454)
2024-12-02 12:13:26,597 [5-server-thread21] INFO o.a.r.s.i.RaftServerImpl:1580
- 5@group-00020000002A: appendEntries* reply
18<-5#1:FAIL-t1,INCONSISTENCY,nextIndex=9455,followerCommit=9454,matchIndex=-1
2024-12-02 12:13:38,575 [5-server-thread20] INFO o.a.r.s.i.RaftServerImpl:1296
- 5@group-00020000002A: receive setConfiguration
SetConfigurationRequest:client-A8BD159712B0->5@group-00020000002A, cid=1,
seq=null, RW, null, SET_UNCONDITIONALLY, servers:[5|172.16.2.4:10750,
18|172.16.2.18:10750], listeners:[]
{code}
It seems that the old leader did not have a snapshot and tried to synchronize
the state via appendEntries. However, the initialization was quite slow,
causing the process to exceed the 3 RPC timeout duration without returning any
RPC responses, ultimately causing the migration task to fail. However, the
client seems to keep retrying, and subsequent retries all fail.
There are two things we need to consider here:
- Why was the initial initialization so slow, causing the migration task to
fail? Considering Java's GC, are there parameters we can adjust to control
this? (Currently, IoTDB's default rpcTimeoutMax is set to 4 seconds for the
maximum election timeout). Are there any optimizations on the kernel level that
can be made to remove these magic number?
- If this migration task fails, are there any resources that were not properly
released, preventing the second attempt from succeeding? Or should we delete
and recreate the new peer in new node to continue with the migration?
[~szetszwo] What do you think?
> Fixed issue where Ratis member changes are stuck
> ------------------------------------------------
>
> Key: RATIS-2203
> URL: https://issues.apache.org/jira/browse/RATIS-2203
> Project: Ratis
> Issue Type: Improvement
> Reporter: Xinyu Tan
> Assignee: Xinyu Tan
> Priority: Major
>
> Currently, in IoTDB, we use ratisclient to submit membership change requests,
> and rely on the
> [retrypolicy|https://github.com/apache/iotdb/blob/master/iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/ratis/RatisClient.java#L236-L249]
> to ensure that membership change requests with longer durations can
> eventually be successfully completed.
> In a recent test, we discovered that Ratis might get stuck during a
> membership change. After a series of troubleshooting steps, we have mostly
> determined that this is a bug in the Ratis core.
> Now, I will list the logs of the leader and the new replica to see if we can
> analyze the cause of the error together.
> leader
> {code:java}
> 2024-11-29 11:50:49,166 [19-server-thread1] INFO
> o.a.r.s.i.RaftServerImpl:1296 - 19@group-000200000061: receive
> setConfiguration
> SetConfigurationRequest:client-5FD616FE3614->19@group-000200000061, cid=1,
> seq=null, RW, null, SET_UNCONDITIONALLY, servers:[4|172.16.2.3:10750,
> 19|172.16.2.18:10750], listeners:[]
> 2024-11-29 11:50:49,168 [19-server-thread1] INFO
> o.a.r.s.i.LeaderStateImpl:485 - 19@group-000200000061-LeaderStateImpl:
> startSetConfiguration
> SetConfigurationRequest:client-5FD616FE3614->19@group-000200000061, cid=1,
> seq=null, RW, null, SET_UNCONDITIONALLY, servers:[4|172.16.2.3:10750,
> 19|172.16.2.18:10750], listeners:[]
> 2024-11-29 11:50:49,208
> [19@group-000200000061->4-GrpcLogAppender-LogAppenderDaemon] INFO
> o.a.r.s.i.FollowerInfoImpl:154 - Follower 19@group-000200000061->4
> acknowledged installing snapshot
> 2024-11-29 11:50:49,215
> [19@group-000200000061->4-GrpcLogAppender-LogAppenderDaemon] INFO
> o.a.r.g.s.GrpcServerProtocolClient:65 - Build channel for 4|172.16.2.3:10750
> 2024-11-29 11:50:59,412 [timer2] WARN o.a.r.g.s.GrpcLogAppender:441 -
> 19@group-000200000061->4-GrpcLogAppender: Timed out appendEntries,
> errorCount=1,
> request=AppendEntriesRequest:cid=1,entriesCount=9198,entries=(t:1,
> i:0)...(t:1, i:9197)
> 2024-11-29 11:50:59,732 [timer3] WARN o.a.r.g.s.GrpcLogAppender:441 -
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
> errorCount=2, request=AppendEntriesRequest:cid=2,entriesCount=0
> 2024-11-29 11:51:00,238 [timer4] WARN o.a.r.g.s.GrpcLogAppender:441 -
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
> errorCount=3, request=AppendEntriesRequest:cid=3,entriesCount=0
> 2024-11-29 11:51:00,740 [timer5] WARN o.a.r.g.s.GrpcLogAppender:441 -
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
> errorCount=4, request=AppendEntriesRequest:cid=4,entriesCount=0
> 2024-11-29 11:51:01,906 [timer6] WARN o.a.r.g.s.GrpcLogAppender:441 -
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
> errorCount=5, request=AppendEntriesRequest:cid=6,entriesCount=0
> 2024-11-29 11:51:02,406 [timer7] WARN o.a.r.g.s.GrpcLogAppender:441 -
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
> errorCount=6, request=AppendEntriesRequest:cid=7,entriesCount=0
> 2024-11-29 11:51:02,908 [timer0] WARN o.a.r.g.s.GrpcLogAppender:441 -
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
> errorCount=7, request=AppendEntriesRequest:cid=8,entriesCount=0
> 2024-11-29 11:51:03,086 [19-server-thread1] INFO
> o.a.r.s.i.RaftServerImpl:1296 - 19@group-000200000061: receive
> setConfiguration
> SetConfigurationRequest:client-5FD616FE3614->19@group-000200000061, cid=1,
> seq=null, RW, null, SET_UNCONDITIONALLY, servers:[4|172.16.2.3:10750,
> 19|172.16.2.18:10750], listeners:[]
> 2024-11-29 11:51:03,410 [timer1] WARN o.a.r.g.s.GrpcLogAppender:441 -
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
> errorCount=8, request=AppendEntriesRequest:cid=9,entriesCount=0
> 2024-11-29 11:51:03,912 [timer2] WARN o.a.r.g.s.GrpcLogAppender:441 -
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
> errorCount=9, request=AppendEntriesRequest:cid=10,entriesCount=0
> 2024-11-29 11:51:04,414 [timer3] WARN o.a.r.g.s.GrpcLogAppender:441 -
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
> errorCount=10, request=AppendEntriesRequest:cid=11,entriesCount=0
> 2024-11-29 11:51:04,914 [timer4] WARN o.a.r.g.s.GrpcLogAppender:441 -
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
> errorCount=11, request=AppendEntriesRequest:cid=12,entriesCount=0
> 2024-11-29 11:51:05,415 [timer5] WARN o.a.r.g.s.GrpcLogAppender:441 -
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
> errorCount=12, request=AppendEntriesRequest:cid=13,entriesCount=0
> 2024-11-29 11:51:05,916 [timer6] WARN o.a.r.g.s.GrpcLogAppender:441 -
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
> errorCount=13, request=AppendEntriesRequest:cid=14,entriesCount=0
> 2024-11-29 11:51:06,418 [timer7] WARN o.a.r.g.s.GrpcLogAppender:441 -
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
> errorCount=14, request=AppendEntriesRequest:cid=15,entriesCount=0
> 2024-11-29 11:51:06,683
> [19@group-000200000061->4-GrpcLogAppender-LogAppenderDaemon] WARN
> o.a.r.g.s.GrpcLogAppender:293 - 19@group-000200000061->4-GrpcLogAppender:
> Wait interrupted by java.lang.InterruptedException
> 2024-11-29 11:51:06,921 [timer0] WARN o.a.r.g.s.GrpcLogAppender:441 -
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
> errorCount=15, request=AppendEntriesRequest:cid=16,entriesCount=0
> 2024-11-29 11:51:07,134 [19-server-thread1] INFO
> o.a.r.s.i.RaftServerImpl:1296 - 19@group-000200000061: receive
> setConfiguration
> SetConfigurationRequest:client-5FD616FE3614->19@group-000200000061, cid=1,
> seq=null, RW, null, SET_UNCONDITIONALLY, servers:[4|172.16.2.3:10750,
> 19|172.16.2.18:10750], listeners:[]
> 2024-11-29 11:51:07,135 [19-server-thread1] INFO
> o.a.r.s.i.LeaderStateImpl:485 - 19@group-000200000061-LeaderStateImpl:
> startSetConfiguration
> SetConfigurationRequest:client-5FD616FE3614->19@group-000200000061, cid=1,
> seq=null, RW, null, SET_UNCONDITIONALLY, servers:[4|172.16.2.3:10750,
> 19|172.16.2.18:10750], listeners:[]
> 2024-11-29 11:51:07,137
> [19@group-000200000061->4-GrpcLogAppender-LogAppenderDaemon] INFO
> o.a.r.s.i.FollowerInfoImpl:154 - Follower 19@group-000200000061->4
> acknowledged installing snapshot
> 2024-11-29 11:51:07,421 [timer1] WARN o.a.r.g.s.GrpcLogAppender:441 -
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
> errorCount=16, request=AppendEntriesRequest:cid=17,entriesCount=0
> 2024-11-29 11:51:07,924 [timer2] WARN o.a.r.g.s.GrpcLogAppender:441 -
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
> errorCount=17, request=AppendEntriesRequest:cid=18,entriesCount=0
> 2024-11-29 11:51:08,426 [timer3] WARN o.a.r.g.s.GrpcLogAppender:441 -
> 19@group-000200000061->4-GrpcLogAppender: Timed out HEARTBEAT appendEntries,
> errorCount=18, request=AppendEntriesRequest:cid=19,entriesCount=0
> 2024-11-29 11:51:08,950 [grpc-default-executor-5] WARN
> o.a.r.g.s.GrpcLogAppender$AppendLogResponseHandler:519 -
> 19@group-000200000061->4-AppendLogResponseHandler: received INCONSISTENCY
> reply with nextIndex 1, errorCount=1,
> request=AppendEntriesRequest:cid=1,entriesCount=9198,entries=(t:1,
> i:0)...(t:1, i:9197)
> 2024-11-29 11:51:08,951 [grpc-default-executor-5] INFO
> o.a.r.s.i.FollowerInfoImpl:70 - 19@group-000200000061->4: setNextIndex
> nextIndex: updateUnconditionally 9198 -> 1
> 2024-11-29 11:51:08,972 [grpc-default-executor-5] INFO
> o.a.r.g.s.GrpcLogAppender$AppendLogResponseHandler:550 -
> 19@group-000200000061->4-AppendLogResponseHandler: follower responses
> appendEntries COMPLETED
> 2024-11-29 11:51:08,975 [grpc-default-executor-5] INFO
> o.a.r.s.i.FollowerInfoImpl:70 - 19@group-000200000061->4: computeNextIndex
> nextIndex: updateUnconditionally 9198 -> 0
> 2024-11-29 11:51:09,015 [grpc-default-executor-5] WARN
> o.a.r.g.s.GrpcLogAppender$AppendLogResponseHandler:519 -
> 19@group-000200000061->4-AppendLogResponseHandler: received INCONSISTENCY
> reply with nextIndex 9198, errorCount=1,
> request=AppendEntriesRequest:cid=6,entriesCount=9197,entries=(t:1,
> i:1)...(t:1, i:9197)
> 2024-11-29 11:51:09,022 [grpc-default-executor-1] INFO
> o.a.r.g.s.GrpcLogAppender$AppendLogResponseHandler:550 -
> 19@group-000200000061->4-AppendLogResponseHandler: follower responses
> appendEntries COMPLETED
> {code}
> new adding follower
> {code:java}
> 2024-11-29 11:50:46,528 [4-groupManagement] INFO
> o.a.r.s.i.RaftServerImpl:405 - 4@group-000200000061: Successfully started.
> 2024-11-29 11:50:46,915 [4-server-thread1] INFO
> o.a.r.s.i.RaftServerImpl:1296 - 4@group-000200000061: receive
> setConfiguration
> SetConfigurationRequest:client-5FD616FE3614->4@group-000200000061, cid=1,
> seq=null, RW, null, SET_UNCONDITIONALLY, servers:[4|172.16.2.3:10750,
> 19|172.16.2.18:10750], listeners:[]
> 2024-11-29 11:50:50,886 [4-server-thread1] INFO o.a.r.s.i.RaftServerImpl:562
> - 4@group-000200000061: set firstElectionSinceStartup to false for
> appendEntries
> 2024-11-29 11:50:50,886 [4-server-thread1] INFO
> o.a.i.d.c.s.s.SchemaRegionStateMachine:75 - Current node [nodeId: 4] is no
> longer the schema region leader [regionId: SchemaRegion[97]], the new leader
> is [nodeId:19]
> 2024-11-29 11:50:50,886 [4-server-thread1] INFO o.a.r.s.impl.ServerState:272
> - 4@group-000200000061: change Leader from null to 19 at term 1 for
> appendEntries, leader elected after 4575ms
> 2024-11-29 11:50:50,887 [4-server-thread2] INFO o.a.r.s.impl.ServerState:386
> - 4@group-000200000061: set configuration conf: {index: 0,
> cur=peers:[19|172.16.2.18:10750]|listeners:[], old=null}
> 2024-11-29 11:50:50,887 [4-server-thread2] INFO
> o.a.r.s.r.s.SegmentedRaftLogWorker:438 -
> 4@group-000200000061-SegmentedRaftLogWorker: Starting segment from index:0
> 2024-11-29 11:50:50,888 [4@group-000200000061-SegmentedRaftLogWorker] INFO
> o.a.r.s.r.s.BufferedWriteChannel:61 - open log_inprogress_0 at position 0
> 2024-11-29 11:50:56,801 [4@group-000200000061-SegmentedRaftLogWorker] INFO
> o.a.r.s.r.s.SegmentedRaftLogWorker$StartLogSegment:645 -
> 4@group-000200000061-SegmentedRaftLogWorker: created new log segment
> /data1/iotdb/tablev2_rc5_1128_1fc99a6_1rep/data/datanode/consensus/schema_region/47474747-4747-4747-4747-000200000061/current/log_inprogress_0
>
> 2024-11-29 11:51:01,063 [4-server-thread21] INFO
> o.a.r.s.i.RaftServerImpl:1296 - 4@group-000200000061: receive
> setConfiguration
> SetConfigurationRequest:client-5FD616FE3614->4@group-000200000061, cid=1,
> seq=null, RW, null, SET_UNCONDITIONALLY, servers:[4|172.16.2.3:10750,
> 19|172.16.2.18:10750], listeners:[]
> 2024-11-29 11:51:05,118 [4-server-thread21] INFO
> o.a.r.s.i.RaftServerImpl:1296 - 4@group-000200000061: receive
> setConfiguration
> SetConfigurationRequest:client-5FD616FE3614->4@group-000200000061, cid=1,
> seq=null, RW, null, SET_UNCONDITIONALLY, servers:[4|172.16.2.3:10750,
> 19|172.16.2.18:10750], listeners:[]
> 2024-11-29 11:51:08,942 [4-server-thread22] INFO
> o.a.r.s.i.RaftServerImpl:1638 - 4@group-000200000061: Failed appendEntries as
> the first entry (index 0) already exists (snapshotIndex: -1, commitIndex: 0)
> 2024-11-29 11:51:08,943 [4-server-thread22] INFO
> o.a.r.s.i.RaftServerImpl:1580 - 4@group-000200000061: appendEntries* reply
> 19<-4#1:FAIL-t1,INCONSISTENCY,nextIndex=1,followerCommit=-1,matchIndex=-1
> 2024-11-29 11:51:09,003 [4-server-thread2] INFO
> o.a.r.s.i.RaftServerImpl:1638 - 4@group-000200000061: Failed appendEntries as
> the first entry (index 1) already exists (snapshotIndex: -1, commitIndex:
> 9197)
> 2024-11-29 11:51:09,004 [4-server-thread2] INFO
> o.a.r.s.i.RaftServerImpl:1580 - 4@group-000200000061: appendEntries* reply
> 19<-4#6:FAIL-t1,INCONSISTENCY,nextIndex=9198,followerCommit=9197,matchIndex=-1
>
> {code}
> From the logs, we can see that the synchronization link from the leader to
> the new follower has been closed, so the logs cannot be committed. This
> causes the membership change thread to be stuck indefinitely while waiting
> for the follower to catch up.
> Based on the logs above, I have a few suspicions:
> - Why is the gRPC stream API on the leader side being completed? I haven't
> seen any place on the follower side where this function is called.
> - The commitIndex obtained in the checkInconsistentAppendEntries function on
> the follower side seems inconsistent with the one obtained in
> appendEntriesAsync. For example, in the following two lines of logs, the same
> thread gets a commitIndex of -1 outside the lock, and an index of 0 inside
> the lock. This causes the subsequent synchronization to be completely stuck.
> I'm not sure if this is a concurrency issue, but perhaps we should move this
> line inside the
> [lock|ttps://github.com/apache/ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L1591]?
> {code:java}
> 2024-11-29 11:51:08,942 [4-server-thread22] INFO
> o.a.r.s.i.RaftServerImpl:1638 - 4@group-000200000061: Failed appendEntries as
> the first entry (index 0) already exists (snapshotIndex: -1, commitIndex: 0)
> 2024-11-29 11:51:08,943 [4-server-thread22] INFO
> o.a.r.s.i.RaftServerImpl:1580 - 4@group-000200000061: appendEntries* reply
> 19<-4#1:FAIL-t1,INCONSISTENCY,nextIndex=1,followerCommit=-1,matchIndex=-1
> {code}
> What's your opinion? [~szetszwo]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)