[
https://issues.apache.org/jira/browse/RATIS-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tsz-wo Sze resolved RATIS-1305.
-------------------------------
Fix Version/s: (was: 1.1.0)
2.0.0
Resolution: Fixed
I have merged the pull request. Thanks, [~erose]!
> Leader stuck in infinite install snapshot cycle when logs have been purged
> --------------------------------------------------------------------------
>
> Key: RATIS-1305
> URL: https://issues.apache.org/jira/browse/RATIS-1305
> Project: Ratis
> Issue Type: Bug
> Components: server
> Reporter: Ethan Rose
> Assignee: Ethan Rose
> Priority: Blocker
> Fix For: 2.0.0
>
> Time Spent: 2h 40m
> Remaining Estimate: 0h
>
> After logs have been purged from the leader and followers, the leader
> repeatedly attempts to send snapshots to the followers, who reject them
> because there have not been any new transactions to apply. The leader
> continues to send snapshots infinitely.
> Here is an example of the log messages. om1 is the leader, om2 and om3 are
> followers.
> On the leader om1:
> {code:java}
> om1_1 | 2021-02-02 17:17:23,261
> [om1@group-D66704EFC61C->om2-GrpcLogAppender-LogAppenderDaemon] INFO
> server.GrpcLogAppender: om1@group-D66704EFC61C->om2-GrpcLogAppender:
> followerNextIndex = 337 but logStartIndex = -1, notify follower to install
> snapshot-(t:1, i:337)
> om1_1 | 2021-02-02 17:17:23,272
> [om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
> server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender:
> followerNextIndex = 337 but logStartIndex = -1, notify follower to install
> snapshot-(t:1, i:337)
> om1_1 | 2021-02-02 17:17:23,286
> [om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
> server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send
> om1->om3#0-t1,notify:(t:1, i:337)
> om1_1 | 2021-02-02 17:17:23,286
> [om1@group-D66704EFC61C->om2-GrpcLogAppender-LogAppenderDaemon] INFO
> server.GrpcLogAppender: om1@group-D66704EFC61C->om2-GrpcLogAppender: send
> om1->om2#0-t1,notify:(t:1, i:337)
> om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO
> server.GrpcLogAppender:
> om1@group-D66704EFC61C->om3-InstallSnapshotResponseHandler: received a reply
> om1<-om3#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336
> om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO
> server.GrpcLogAppender:
> om1@group-D66704EFC61C->om3-InstallSnapshotResponseHandler: Already Installed
> Snapshot Index 336.
> om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO
> leader.FollowerInfo: om1@group-D66704EFC61C->om3: snapshotIndex:
> setUnconditionally 0 -> 336
> om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO
> leader.FollowerInfo: om1@group-D66704EFC61C->om3: matchIndex:
> setUnconditionally 336 -> 336
> om1_1 | 2021-02-02 17:17:23,523 [grpc-default-executor-1] INFO
> leader.FollowerInfo: om1@group-D66704EFC61C->om3: nextIndex:
> setUnconditionally 337 -> 337
> om1_1 | 2021-02-02 17:17:23,523 [grpc-default-executor-1] INFO
> leader.FollowerInfo: om1@group-D66704EFC61C->om3: nextIndex: updateToMax
> old=337, new=337, updated? false
> om1_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-1] INFO
> server.GrpcLogAppender:
> om1@group-D66704EFC61C->om2-InstallSnapshotResponseHandler: received a reply
> om1<-om2#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336
> om1_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-1] INFO
> server.GrpcLogAppender:
> om1@group-D66704EFC61C->om2-InstallSnapshotResponseHandler: Already Installed
> Snapshot Index 336.
> {code}
>
> On follower om2:
> {code:java}
> om2_1 | 2021-02-02 17:17:23,306 [grpc-default-executor-0] INFO
> server.RaftServer$Division: om2@group-D66704EFC61C: receive installSnapshot:
> om1->om2#0-t1,notify:(t:1, i:337)
> om2_1 | 2021-02-02 17:17:23,312 [grpc-default-executor-0] INFO
> server.RaftServer$Division: om2@group-D66704EFC61C: StateMachine
> snapshotIndex is 336
> om2_1 | 2021-02-02 17:17:23,560 [grpc-default-executor-0] INFO
> server.RaftServer$Division: om2@group-D66704EFC61C: set new configuration
> configurationEntry {
> om2_1 | peers {
> om2_1 | id: "om1"
> om2_1 | address: "om1:9872"
> om2_1 | }
> om2_1 | peers {
> om2_1 | id: "om3"
> om2_1 | address: "om3:9872"
> om2_1 | }
> om2_1 | peers {
> om2_1 | id: "om2"
> om2_1 | address: "om2:9872"
> om2_1 | }
> om2_1 | }
> om2_1 | from snapshot
> om2_1 | 2021-02-02 17:17:23,561 [grpc-default-executor-0] INFO
> server.RaftServer$Division: om2@group-D66704EFC61C: set configuration 0:
> [om1|rpc:om1:9872|dataStream:|priority:0,
> om3|rpc:om3:9872|dataStream:|priority:0,
> om2|rpc:om2:9872|dataStream:|priority:0], old=null
> om2_1 | 2021-02-02 17:17:23,567 [grpc-default-executor-0] INFO
> server.RaftServer$Division: om2@group-D66704EFC61C: reply installSnapshot:
> om1<-om2#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336
> om2_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-0] INFO
> server.GrpcServerProtocolService: om2: Completed INSTALL_SNAPSHOT,
> lastRequest: om1->om2#0-t1,notify:(t:1, i:337)
> {code}
>
> These log messages are repeated forever until the cluster is terminated. The
> term and index numbers do not change.
>
> EDIT: I think the cluster not responding was caused by something different.
> This behavior is still a bug, however.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)