[
https://issues.apache.org/jira/browse/RATIS-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879663#comment-17879663
]
Tsz-wo Sze commented on RATIS-2137:
-----------------------------------
[~lemony], are you going to build it yourself? Or do you need a release based
on 2.5.1? Please feel free to share your thought.
> Leader fails to send correct index to follower after timeout exception
> ----------------------------------------------------------------------
>
> Key: RATIS-2137
> URL: https://issues.apache.org/jira/browse/RATIS-2137
> Project: Ratis
> Issue Type: Bug
> Components: server
> Affects Versions: 2.5.1
> Reporter: Kevin Liu
> Assignee: Kevin Liu
> Priority: Major
> Fix For: 3.2.0
>
> Attachments: image-2024-08-13-11-28-16-250.png
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> *I found that after the following log, the follower became unavailable. The
> follower received incorrect entries repeatedly for about 10min, then got
> installSnapshot failed and started to election. After two hours, it succeed
> to install snapshot, but failed to updateLastAppliedTermIndex. After that, it
> repeated 'receive installSnapshot and installSnapshot failed' for several
> hours until I restarted the server.*
> 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division:
> 1@group-47BEDE733167: Failed appendEntries as the first entry (index
> 34795876) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
> 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division:
> 1@group-47BEDE733167: inconsistency entries.
> Reply:3<-1#2559343:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
> 24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division:
> 1@group-47BEDE733167: Failed appendEntries as the first entry (index
> 34795875) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
> 24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division:
> 1@group-47BEDE733167: inconsistency entries.
> Reply:3<-1#2559406:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
> *....(repeat 'Failed appendEntries')*
> 24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division:
> 1@group-47BEDE733167: Failed appendEntries as the first entry (index
> 34465382) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
> 24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division:
> 1@group-47BEDE733167: inconsistency entries.
> Reply:3<-1#2892557:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
> 24/08/11 09:15:42,230 INFO [nioEventLoopGroup-3-3]
> SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot:
> 3->1#0-t59,chunk:bbe49073-5dad-4499-9051-58a0e53b0658,0
> 24/08/11 09:15:42,231 ERROR [nioEventLoopGroup-3-3]
> SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed
> java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is
> 34795893, last included index in snapshot is 34670057
> 24/08/11 09:15:42,233 WARN [nioEventLoopGroup-3-3] DefaultChannelPipeline: An
> exceptionCaught() event was fired, and it reached at the tail of the
> pipeline. It usually means the last handler in the pipeline did not handle
> the exception.
> java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is
> 34795893, last included index in snapshot is 34670057
> 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState]
> FollowerState: 1@group-47BEDE733167-FollowerState: change to CANDIDATE,
> lastRpcElapsedTime:7874610911ns, electionTimeout:3353ms
> 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1:
> shutdown 1@group-47BEDE733167-FollowerState
> 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState]
> RaftServer$Division: 1@group-47BEDE733167: changes role from FOLLOWER to
> CANDIDATE at term 59 for changeToCandidate
> 24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState]
> RaftServerConfigKeys: raft.server.leaderelection.pre-vote = true (default)
> 24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1:
> start 1@group-47BEDE733167-LeaderElection5
> 24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5]
> RaftServer$Division: 1@group-47BEDE733167: change Leader from 3 to null at
> term 59 for PRE_VOTE
> 24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5]
> LeaderElection: 1@group-47BEDE733167-LeaderElection5 PRE_VOTE round 0: submit
> vote requests at term 59 for 34233595:
> peers:[1|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER,
>
> 2|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER,
>
> 3|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER]|listeners:[],
> old=null
> 24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5]
> LeaderElection: 1@group-47BEDE733167-LeaderElection5: PRE_VOTE REJECTED
> received 2 response(s) and 0 exception(s):
> 24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5]
> LeaderElection: Response 0: 1<-2#0:FAIL-t59
> 24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5]
> LeaderElection: Response 1: 1<-3#0:FAIL-t59
> 24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5]
> LeaderElection: 1@group-47BEDE733167-LeaderElection5 PRE_VOTE round 0: result
> REJECTED
> 24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5]
> RaftServer$Division: 1@group-47BEDE733167: changes role from CANDIDATE to
> FOLLOWER at term 59 for REJECTED
> 24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] RoleInfo:
> 1: shutdown 1@group-47BEDE733167-LeaderElection5
> 24/08/11 09:15:50,111 INFO [1@group-47BEDE733167-LeaderElection5] RoleInfo:
> 1: start 1@group-47BEDE733167-FollowerState
> 24/08/11 09:15:52,075 INFO [nioEventLoopGroup-3-3]
> SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot:
> 3->1#0-t59,chunk:802b8106-076b-4ffc-8cf1-1ace04952c7a,0
> 24/08/11 09:15:52,075 INFO [nioEventLoopGroup-3-3] RaftServer$Division:
> 1@group-47BEDE733167: change Leader from null to 3 at term 59 for
> installSnapshot, leader elected after 1967ms
> 24/08/11 09:15:52,075 ERROR [nioEventLoopGroup-3-3]
> SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed
> java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is
> 34795893, last included index in snapshot is 34670057
> *(repeat 'election, fail, install snapshot fail' for more than 2 hours)*
> *(still election fail, but this time install snapshot successfully, but
> failed to updateLastAppliedTermIndex)*
> 24/08/11 11:52:38,168 INFO [nioEventLoopGroup-3-3]
> SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot:
> 3->1#0-t59,chunk:dcc73f7c-c17d-44e5-96b6-f6f1570e68a7,0
> 24/08/11 11:52:38,168 INFO [nioEventLoopGroup-3-3] RaftServer$Division:
> 1@group-47BEDE733167: change Leader from null to 3 at term 59 for
> installSnapshot, leader elected after 3610ms
> 24/08/11 11:52:39,434 INFO [nioEventLoopGroup-3-3] SnapshotManager:
> Installing snapshot:3->1#0-t59,chunk:dcc73f7c-c17d-44e5-96b6-f6f1570e68a7,0,
> to tmp
> dir:/data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/tmp/snapshot-dcc73f7c-c17d-44e5-96b6-f6f1570e68a7
> 24/08/11 11:52:39,455 INFO [nioEventLoopGroup-3-3] SnapshotManager: Installed
> snapshot, renaming temporary dir
> /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/tmp/snapshot-dcc73f7c-c17d-44e5-96b6-f6f1570e68a7
> to /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/sm
> 24/08/11 11:52:40,829 INFO [1@group-47BEDE733167-StateMachineUpdater]
> StateMachine: Reinitializing state machine.
> 24/08/11 11:52:40,829 INFO [nioEventLoopGroup-3-3] RaftLog:
> 1@group-47BEDE733167-SegmentedRaftLog: snapshotIndex: updateIncreasingly
> 33870802 -> 34870063
> 24/08/11 11:52:40,829 INFO [1@group-47BEDE733167-StateMachineUpdater]
> StateMachine: obsolete snapshot provided: (t:59, i:34670809)
> 24/08/11 11:52:40,829 INFO [nioEventLoopGroup-3-3] SegmentedRaftLogWorker:
> 1@group-47BEDE733167-SegmentedRaftLogWorker: flushIndex: setUnconditionally
> 34795893 -> 34870063
> 24/08/11 11:52:40,829 INFO [nioEventLoopGroup-3-3] SegmentedRaftLogWorker:
> 1@group-47BEDE733167-SegmentedRaftLogWorker: safeCacheEvictIndex:
> setUnconditionally 34783743 -> 34870063
> 24/08/11 11:52:40,829 INFO [1@group-47BEDE733167-StateMachineUpdater]
> StateMachineUpdater: 1@group-47BEDE733167-StateMachineUpdater: appliedIndex:
> setUnconditionally 34795893 -> 34670809
> 24/08/11 11:52:40,829 INFO [nioEventLoopGroup-3-3] SegmentedRaftLogWorker:
> 1@group-47BEDE733167-SegmentedRaftLogWorker: Closing segment
> log_inprogress_34783744 to index: 34795893
> 24/08/11 11:52:40,837 INFO [nioEventLoopGroup-3-3]
> SnapshotInstallationHandler: 1@group-47BEDE733167: successfully install the
> entire snapshot-34870063
> 24/08/11 11:52:40,877 INFO [nioEventLoopGroup-3-3]
> SnapshotInstallationHandler: 1@group-47BEDE733167: reply installSnapshot:
> 3<-1#0:OK-t59,SUCCESS,requestIndex=0
> 24/08/11 11:52:40,902 INFO [1@group-47BEDE733167-SegmentedRaftLogWorker]
> SegmentedRaftLogWorker: 1@group-47BEDE733167-SegmentedRaftLogWorker: Rolled
> log segment from
> /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/current/log_inprogress_34783744
> to
> /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/current/log_34783744-34795893
> 24/08/11 11:52:41,160 ERROR [1@group-47BEDE733167-StateMachineUpdater]
> StateMachineUpdater: 1@group-47BEDE733167-StateMachineUpdater caught a
> Throwable.
> java.lang.IllegalStateException: 1: Failed updateLastAppliedTermIndex: newTI
> = (t:59, i:34670810) < oldTI = (t:59, i:34795893)
> 24/08/11 11:52:41,161 INFO [1@group-47BEDE733167-StateMachineUpdater]
> RaftServer$Division: 1@group-47BEDE733167: shutdown
> 24/08/11 11:52:41,161 INFO [1@group-47BEDE733167-StateMachineUpdater]
> JmxRegister: Successfully un-registered JMX Bean with object name
> Ratis:service=RaftServer,group=group-47BEDE733167,id=1
> 24/08/11 11:52:41,161 INFO [1@group-47BEDE733167-StateMachineUpdater]
> RoleInfo: 1: shutdown 1@group-47BEDE733167-FollowerState
> 24/08/11 11:52:41,161 INFO [1@group-47BEDE733167-FollowerState]
> FollowerState: 1@group-47BEDE733167-FollowerState was interrupted
> 24/08/11 11:52:41,161 INFO [1@group-47BEDE733167-StateMachineUpdater]
> RaftServer$Division: 1@group-47BEDE733167: closes. applyIndex: 34670810
> 24/08/11 11:52:42,340 INFO [1@group-47BEDE733167-StateMachineUpdater]
> SegmentedRaftLogWorker: 1@group-47BEDE733167-SegmentedRaftLogWorker close()
> 24/08/11 11:52:48,170 INFO [nioEventLoopGroup-3-3]
> SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot:
> 3->1#0-t59,chunk:7179265b-d007-4835-9b5f-fa9e4d853b72,0
> 24/08/11 11:52:48,170 ERROR [nioEventLoopGroup-3-3]
> SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed
> org.apache.ratis.protocol.exceptions.ServerNotReadyException:
> 1@group-47BEDE733167 is not in [STARTING, RUNNING]: current state is CLOSED
> 24/08/11 11:52:58,171 INFO [nioEventLoopGroup-3-3]
> SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot:
> 3->1#0-t59,chunk:7f56f35f-20f3-4b42-bdb6-c2d52db646fc,0
> 24/08/11 11:52:58,171 ERROR [nioEventLoopGroup-3-3]
> SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed
> org.apache.ratis.protocol.exceptions.ServerNotReadyException:
> 1@group-47BEDE733167 is not in [STARTING, RUNNING]: current state is CLOSED
> *(repeat 'receive installSnapshot and installSnapshot failed' for serval
> hours until I restarted the server)*
> *Here is what I found in the leader's log*
> 24/08/11 09:03:10,130 WARN
> [3@group-47BEDE733167->1-LogAppenderDefault-LogAppenderDaemon] LogAppender:
> 3@group-47BEDE733167->1-LogAppenderDefault: Failed to appendEntries
> (retry=1): org.apache.ratis.protocol.exceptions.TimeoutIOException
> 24/08/11 09:03:13,714 INFO
> [3@group-47BEDE733167->1-LogAppenderDefault-LogAppenderDaemon] FollowerInfo:
> 3@group-47BEDE733167->1: decreaseNextIndex nextIndex: updateUnconditionally
> 34795876 -> 34795875
> 24/08/11 09:03:13,715 INFO
> [3@group-47BEDE733167->1-LogAppenderDefault-LogAppenderDaemon] FollowerInfo:
> 3@group-47BEDE733167->1: decreaseNextIndex nextIndex: updateUnconditionally
> 34795875 -> 34795874
> *I guess when the leader called appendEntries for the first time, the
> follower had been executed successfully, but the leader did not receive the
> follower's response and times out. When it resent, it found that it could not
> match the change and started to decreaseNextIndex.*
> *Sometimes it can be fixed automatically by rolling segment log, but not
> always.*
> 24/08/12 11:47:53,351 INFO [nioEventLoopGroup-3-3] RaftServer$Division:
> 2@group-47BEDE733167: Failed appendEntries as the first entry (index
> 35049128) already exists (snapshotIndex: 35070063, commitIndex: 35259248)
> 24/08/12 11:47:53,351 INFO [nioEventLoopGroup-3-3] RaftServer$Division:
> 2@group-47BEDE733167: inconsistency entries.
> Reply:3<-2#3583083:FAIL-t59,INCONSISTENCY,nextIndex=35259249,followerCommit=35259248,matchIndex=-1
> 24/08/12 11:47:55,132 INFO [nioEventLoopGroup-3-3] SegmentedRaftLogWorker:
> 2@group-47BEDE733167-SegmentedRaftLogWorker: Rolling segment
> log-35250411_35261308 to index:35261308
> 24/08/12 11:54:46,402 INFO [2@group-47BEDE733167-StateMachineUpdater]
> StateMachine: Taking a snapshot to file
> /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/tmp/raft_snapshot_1723434886330_4407392949522239377.dat.
> 24/08/12 11:54:46,403 INFO [2@group-47BEDE733167-StateMachineUpdater]
> StateMachine: Saving digest 272e2c87a8e62785588ab383c0b8c290 for snapshot
> file
> /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/sm/snapshot.59_35270066.
> 24/08/12 11:54:46,458 INFO [2@group-47BEDE733167-StateMachineUpdater]
> StateMachine: Renaming a snapshot file
> /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/tmp/raft_snapshot_1723434886330_4407392949522239377.dat
> to
> /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/sm/snapshot.59_35270066.
> 24/08/12 11:54:46,487 INFO [2@group-47BEDE733167-StateMachineUpdater]
> SimpleStateMachineStorage: Deleting old snapshot at
> /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/sm/snapshot.59_34670063
> 24/08/12 11:54:46,530 INFO [2@group-47BEDE733167-StateMachineUpdater]
> StateMachineUpdater: 2@group-47BEDE733167-StateMachineUpdater: Took a
> snapshot at index 35270066
> 24/08/12 11:54:46,530 INFO [2@group-47BEDE733167-StateMachineUpdater]
> StateMachineUpdater: 2@group-47BEDE733167-StateMachineUpdater: snapshotIndex:
> updateIncreasingly 35070063 -> 35270066
> 24/08/12 11:57:16,592 INFO [nioEventLoopGroup-3-3] SegmentedRaftLogWorker:
> 2@group-47BEDE733167-SegmentedRaftLogWorker: Rolling segment
> log-35261309_35271592 to index:35271592
--
This message was sent by Atlassian Jira
(v8.20.10#820010)