[ 
https://issues.apache.org/jira/browse/RATIS-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved RATIS-1305.
-------------------------------
    Fix Version/s:     (was: 1.1.0)
                   2.0.0
       Resolution: Fixed

I have merged the pull request.  Thanks, [~erose]!

> Leader stuck in infinite install snapshot cycle when logs have been purged
> --------------------------------------------------------------------------
>
>                 Key: RATIS-1305
>                 URL: https://issues.apache.org/jira/browse/RATIS-1305
>             Project: Ratis
>          Issue Type: Bug
>          Components: server
>            Reporter: Ethan Rose
>            Assignee: Ethan Rose
>            Priority: Blocker
>             Fix For: 2.0.0
>
>          Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> After logs have been purged from the leader and followers, the leader 
> repeatedly attempts to send snapshots to the followers, who reject them 
> because there have not been any new transactions to apply. The leader 
> continues to send snapshots infinitely.
> Here is an example of the log messages. om1 is the leader, om2 and om3 are 
> followers.
> On the leader om1:
> {code:java}
> om1_1 | 2021-02-02 17:17:23,261 
> [om1@group-D66704EFC61C->om2-GrpcLogAppender-LogAppenderDaemon] INFO 
> server.GrpcLogAppender: om1@group-D66704EFC61C->om2-GrpcLogAppender: 
> followerNextIndex = 337 but logStartIndex = -1, notify follower to install 
> snapshot-(t:1, i:337)
> om1_1 | 2021-02-02 17:17:23,272 
> [om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO 
> server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: 
> followerNextIndex = 337 but logStartIndex = -1, notify follower to install 
> snapshot-(t:1, i:337)
> om1_1 | 2021-02-02 17:17:23,286 
> [om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO 
> server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send 
> om1->om3#0-t1,notify:(t:1, i:337)
> om1_1 | 2021-02-02 17:17:23,286 
> [om1@group-D66704EFC61C->om2-GrpcLogAppender-LogAppenderDaemon] INFO 
> server.GrpcLogAppender: om1@group-D66704EFC61C->om2-GrpcLogAppender: send 
> om1->om2#0-t1,notify:(t:1, i:337)
> om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO 
> server.GrpcLogAppender: 
> om1@group-D66704EFC61C->om3-InstallSnapshotResponseHandler: received a reply 
> om1<-om3#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336
> om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO 
> server.GrpcLogAppender: 
> om1@group-D66704EFC61C->om3-InstallSnapshotResponseHandler: Already Installed 
> Snapshot Index 336.
> om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO 
> leader.FollowerInfo: om1@group-D66704EFC61C->om3: snapshotIndex: 
> setUnconditionally 0 -> 336
> om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO 
> leader.FollowerInfo: om1@group-D66704EFC61C->om3: matchIndex: 
> setUnconditionally 336 -> 336
> om1_1 | 2021-02-02 17:17:23,523 [grpc-default-executor-1] INFO 
> leader.FollowerInfo: om1@group-D66704EFC61C->om3: nextIndex: 
> setUnconditionally 337 -> 337
> om1_1 | 2021-02-02 17:17:23,523 [grpc-default-executor-1] INFO 
> leader.FollowerInfo: om1@group-D66704EFC61C->om3: nextIndex: updateToMax 
> old=337, new=337, updated? false
> om1_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-1] INFO 
> server.GrpcLogAppender: 
> om1@group-D66704EFC61C->om2-InstallSnapshotResponseHandler: received a reply 
> om1<-om2#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336
> om1_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-1] INFO 
> server.GrpcLogAppender: 
> om1@group-D66704EFC61C->om2-InstallSnapshotResponseHandler: Already Installed 
> Snapshot Index 336.
> {code}
>  
> On follower om2:
> {code:java}
> om2_1 | 2021-02-02 17:17:23,306 [grpc-default-executor-0] INFO 
> server.RaftServer$Division: om2@group-D66704EFC61C: receive installSnapshot: 
> om1->om2#0-t1,notify:(t:1, i:337)
> om2_1 | 2021-02-02 17:17:23,312 [grpc-default-executor-0] INFO 
> server.RaftServer$Division: om2@group-D66704EFC61C: StateMachine 
> snapshotIndex is 336
> om2_1 | 2021-02-02 17:17:23,560 [grpc-default-executor-0] INFO 
> server.RaftServer$Division: om2@group-D66704EFC61C: set new configuration 
> configurationEntry {
> om2_1 | peers {
> om2_1 | id: "om1"
> om2_1 | address: "om1:9872"
> om2_1 | }
> om2_1 | peers {
> om2_1 | id: "om3"
> om2_1 | address: "om3:9872"
> om2_1 | }
> om2_1 | peers {
> om2_1 | id: "om2"
> om2_1 | address: "om2:9872"
> om2_1 | }
> om2_1 | }
> om2_1 | from snapshot
> om2_1 | 2021-02-02 17:17:23,561 [grpc-default-executor-0] INFO 
> server.RaftServer$Division: om2@group-D66704EFC61C: set configuration 0: 
> [om1|rpc:om1:9872|dataStream:|priority:0, 
> om3|rpc:om3:9872|dataStream:|priority:0, 
> om2|rpc:om2:9872|dataStream:|priority:0], old=null
> om2_1 | 2021-02-02 17:17:23,567 [grpc-default-executor-0] INFO 
> server.RaftServer$Division: om2@group-D66704EFC61C: reply installSnapshot: 
> om1<-om2#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336
> om2_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-0] INFO 
> server.GrpcServerProtocolService: om2: Completed INSTALL_SNAPSHOT, 
> lastRequest: om1->om2#0-t1,notify:(t:1, i:337)
> {code}
>  
> These log messages are repeated forever until the cluster is terminated. The 
> term and index numbers do not change.
>  
> EDIT: I think the cluster not responding was caused by something different. 
> This behavior is still a bug, however.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to