[
https://issues.apache.org/jira/browse/RATIS-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279307#comment-17279307
]
Ethan Rose commented on RATIS-1305:
-----------------------------------
RATIS-1241 doesn't fix the bug, sorry if this was not clear. RATIS-1241
identified a problem with snapshot installation and attempted a fix, but the
fix does not work as intended. In RATIS-1241, the new branch leaderStartIndex
== RaftLog.INVALID_LOG_INDEX is hit whenever the leader has no logs, regardless
of whether followers are actually behind. This means that if the leader has no
logs for a period of time and there are no transactions in the system, then
every time it heartbeats it will be telling followers to install snapshots,
which they will reject because they are up to date. This Jira is to fix this
issue so that if the leader has no logs, followers only receive install
snapshot notifications when they are actually behind.
> Leader stuck in infinite install snapshot cycle when logs have been purged
> --------------------------------------------------------------------------
>
> Key: RATIS-1305
> URL: https://issues.apache.org/jira/browse/RATIS-1305
> Project: Ratis
> Issue Type: Bug
> Components: server
> Reporter: Ethan Rose
> Assignee: Ethan Rose
> Priority: Blocker
> Fix For: 1.1.0
>
>
> After logs have been purged from the leader and followers, the leader
> repeatedly attempts to send snapshots to the followers, who reject them
> because there have not been any new transactions to apply. The leader
> continues to send snapshots infinitely, however, and the cluster becomes
> unresponsive.
> Here is an example of the log messages. om1 is the leader, om2 and om3 are
> followers.
> On the leader om1:
> {code}
> om1_1 | 2021-02-02 17:17:23,261
> [om1@group-D66704EFC61C->om2-GrpcLogAppender-LogAppenderDaemon] INFO
> server.GrpcLogAppender: om1@group-D66704EFC61C->om2-GrpcLogAppender:
> followerNextIndex = 337 but logStartIndex = -1, notify follower to install
> snapshot-(t:1, i:337)
> om1_1 | 2021-02-02 17:17:23,272
> [om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
> server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender:
> followerNextIndex = 337 but logStartIndex = -1, notify follower to install
> snapshot-(t:1, i:337)
> om1_1 | 2021-02-02 17:17:23,286
> [om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
> server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send
> om1->om3#0-t1,notify:(t:1, i:337)
> om1_1 | 2021-02-02 17:17:23,286
> [om1@group-D66704EFC61C->om2-GrpcLogAppender-LogAppenderDaemon] INFO
> server.GrpcLogAppender: om1@group-D66704EFC61C->om2-GrpcLogAppender: send
> om1->om2#0-t1,notify:(t:1, i:337)
> om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO
> server.GrpcLogAppender:
> om1@group-D66704EFC61C->om3-InstallSnapshotResponseHandler: received a reply
> om1<-om3#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336
> om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO
> server.GrpcLogAppender:
> om1@group-D66704EFC61C->om3-InstallSnapshotResponseHandler: Already Installed
> Snapshot Index 336.
> om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO
> leader.FollowerInfo: om1@group-D66704EFC61C->om3: snapshotIndex:
> setUnconditionally 0 -> 336
> om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO
> leader.FollowerInfo: om1@group-D66704EFC61C->om3: matchIndex:
> setUnconditionally 336 -> 336
> om1_1 | 2021-02-02 17:17:23,523 [grpc-default-executor-1] INFO
> leader.FollowerInfo: om1@group-D66704EFC61C->om3: nextIndex:
> setUnconditionally 337 -> 337
> om1_1 | 2021-02-02 17:17:23,523 [grpc-default-executor-1] INFO
> leader.FollowerInfo: om1@group-D66704EFC61C->om3: nextIndex: updateToMax
> old=337, new=337, updated? false
> om1_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-1] INFO
> server.GrpcLogAppender:
> om1@group-D66704EFC61C->om2-InstallSnapshotResponseHandler: received a reply
> om1<-om2#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336
> om1_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-1] INFO
> server.GrpcLogAppender:
> om1@group-D66704EFC61C->om2-InstallSnapshotResponseHandler: Already Installed
> Snapshot Index 336.
> {code}
>
> On follower om2:
> {code}
> om2_1 | 2021-02-02 17:17:23,306 [grpc-default-executor-0] INFO
> server.RaftServer$Division: om2@group-D66704EFC61C: receive installSnapshot:
> om1->om2#0-t1,notify:(t:1, i:337)
> om2_1 | 2021-02-02 17:17:23,312 [grpc-default-executor-0] INFO
> server.RaftServer$Division: om2@group-D66704EFC61C: StateMachine
> snapshotIndex is 336
> om2_1 | 2021-02-02 17:17:23,560 [grpc-default-executor-0] INFO
> server.RaftServer$Division: om2@group-D66704EFC61C: set new configuration
> configurationEntry {
> om2_1 | peers {
> om2_1 | id: "om1"
> om2_1 | address: "om1:9872"
> om2_1 | }
> om2_1 | peers {
> om2_1 | id: "om3"
> om2_1 | address: "om3:9872"
> om2_1 | }
> om2_1 | peers {
> om2_1 | id: "om2"
> om2_1 | address: "om2:9872"
> om2_1 | }
> om2_1 | }
> om2_1 | from snapshot
> om2_1 | 2021-02-02 17:17:23,561 [grpc-default-executor-0] INFO
> server.RaftServer$Division: om2@group-D66704EFC61C: set configuration 0:
> [om1|rpc:om1:9872|dataStream:|priority:0,
> om3|rpc:om3:9872|dataStream:|priority:0,
> om2|rpc:om2:9872|dataStream:|priority:0], old=null
> om2_1 | 2021-02-02 17:17:23,567 [grpc-default-executor-0] INFO
> server.RaftServer$Division: om2@group-D66704EFC61C: reply installSnapshot:
> om1<-om2#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336
> om2_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-0] INFO
> server.GrpcServerProtocolService: om2: Completed INSTALL_SNAPSHOT,
> lastRequest: om1->om2#0-t1,notify:(t:1, i:337)
> {code}
>
> These log messages are repeated forever until the cluster is terminated. The
> term and index numbers do not change.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)