Ethan Rose created RATIS-1305:
---------------------------------

             Summary: Leader stuck in infinite install snapshot cycle when logs 
have been purged
                 Key: RATIS-1305
                 URL: https://issues.apache.org/jira/browse/RATIS-1305
             Project: Ratis
          Issue Type: Bug
          Components: server
            Reporter: Ethan Rose
            Assignee: Ethan Rose
             Fix For: 1.1.0


After logs have been purged from the leader and followers, the leader 
repeatedly attempts to send snapshots to the followers, who reject them because 
there have not been any new transactions to apply. The leader continues to send 
snapshots infinitely, however, and the cluster becomes unresponsive.

Here is an example of the log messages. om1 is the leader, om2 and om3 are 
followers.

On the leader om1:

{code}

om1_1 | 2021-02-02 17:17:23,261 
[om1@group-D66704EFC61C->om2-GrpcLogAppender-LogAppenderDaemon] INFO 
server.GrpcLogAppender: om1@group-D66704EFC61C->om2-GrpcLogAppender: 
followerNextIndex = 337 but logStartIndex = -1, notify follower to install 
snapshot-(t:1, i:337)
om1_1 | 2021-02-02 17:17:23,272 
[om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO 
server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: 
followerNextIndex = 337 but logStartIndex = -1, notify follower to install 
snapshot-(t:1, i:337)
om1_1 | 2021-02-02 17:17:23,286 
[om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO 
server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send 
om1->om3#0-t1,notify:(t:1, i:337)
om1_1 | 2021-02-02 17:17:23,286 
[om1@group-D66704EFC61C->om2-GrpcLogAppender-LogAppenderDaemon] INFO 
server.GrpcLogAppender: om1@group-D66704EFC61C->om2-GrpcLogAppender: send 
om1->om2#0-t1,notify:(t:1, i:337)
om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO 
server.GrpcLogAppender: 
om1@group-D66704EFC61C->om3-InstallSnapshotResponseHandler: received a reply 
om1<-om3#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336
om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO 
server.GrpcLogAppender: 
om1@group-D66704EFC61C->om3-InstallSnapshotResponseHandler: Already Installed 
Snapshot Index 336.
om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO 
leader.FollowerInfo: om1@group-D66704EFC61C->om3: snapshotIndex: 
setUnconditionally 0 -> 336
om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO 
leader.FollowerInfo: om1@group-D66704EFC61C->om3: matchIndex: 
setUnconditionally 336 -> 336
om1_1 | 2021-02-02 17:17:23,523 [grpc-default-executor-1] INFO 
leader.FollowerInfo: om1@group-D66704EFC61C->om3: nextIndex: setUnconditionally 
337 -> 337
om1_1 | 2021-02-02 17:17:23,523 [grpc-default-executor-1] INFO 
leader.FollowerInfo: om1@group-D66704EFC61C->om3: nextIndex: updateToMax 
old=337, new=337, updated? false
om1_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-1] INFO 
server.GrpcLogAppender: 
om1@group-D66704EFC61C->om2-InstallSnapshotResponseHandler: received a reply 
om1<-om2#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336
om1_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-1] INFO 
server.GrpcLogAppender: 
om1@group-D66704EFC61C->om2-InstallSnapshotResponseHandler: Already Installed 
Snapshot Index 336.

{code}

 

On follower om2:

{code}

om2_1 | 2021-02-02 17:17:23,306 [grpc-default-executor-0] INFO 
server.RaftServer$Division: om2@group-D66704EFC61C: receive installSnapshot: 
om1->om2#0-t1,notify:(t:1, i:337)
om2_1 | 2021-02-02 17:17:23,312 [grpc-default-executor-0] INFO 
server.RaftServer$Division: om2@group-D66704EFC61C: StateMachine snapshotIndex 
is 336
om2_1 | 2021-02-02 17:17:23,560 [grpc-default-executor-0] INFO 
server.RaftServer$Division: om2@group-D66704EFC61C: set new configuration 
configurationEntry {
om2_1 | peers {
om2_1 | id: "om1"
om2_1 | address: "om1:9872"
om2_1 | }
om2_1 | peers {
om2_1 | id: "om3"
om2_1 | address: "om3:9872"
om2_1 | }
om2_1 | peers {
om2_1 | id: "om2"
om2_1 | address: "om2:9872"
om2_1 | }
om2_1 | }
om2_1 | from snapshot
om2_1 | 2021-02-02 17:17:23,561 [grpc-default-executor-0] INFO 
server.RaftServer$Division: om2@group-D66704EFC61C: set configuration 0: 
[om1|rpc:om1:9872|dataStream:|priority:0, 
om3|rpc:om3:9872|dataStream:|priority:0, 
om2|rpc:om2:9872|dataStream:|priority:0], old=null
om2_1 | 2021-02-02 17:17:23,567 [grpc-default-executor-0] INFO 
server.RaftServer$Division: om2@group-D66704EFC61C: reply installSnapshot: 
om1<-om2#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336
om2_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-0] INFO 
server.GrpcServerProtocolService: om2: Completed INSTALL_SNAPSHOT, lastRequest: 
om1->om2#0-t1,notify:(t:1, i:337)

{code}

 

These log messages are repeated forever until the cluster is terminated. The 
term and index numbers do not change.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to