Ethan Rose created RATIS-1305:
---------------------------------
Summary: Leader stuck in infinite install snapshot cycle when logs
have been purged
Key: RATIS-1305
URL: https://issues.apache.org/jira/browse/RATIS-1305
Project: Ratis
Issue Type: Bug
Components: server
Reporter: Ethan Rose
Assignee: Ethan Rose
Fix For: 1.1.0
After logs have been purged from the leader and followers, the leader
repeatedly attempts to send snapshots to the followers, who reject them because
there have not been any new transactions to apply. The leader continues to send
snapshots infinitely, however, and the cluster becomes unresponsive.
Here is an example of the log messages. om1 is the leader, om2 and om3 are
followers.
On the leader om1:
{code}
om1_1 | 2021-02-02 17:17:23,261
[om1@group-D66704EFC61C->om2-GrpcLogAppender-LogAppenderDaemon] INFO
server.GrpcLogAppender: om1@group-D66704EFC61C->om2-GrpcLogAppender:
followerNextIndex = 337 but logStartIndex = -1, notify follower to install
snapshot-(t:1, i:337)
om1_1 | 2021-02-02 17:17:23,272
[om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender:
followerNextIndex = 337 but logStartIndex = -1, notify follower to install
snapshot-(t:1, i:337)
om1_1 | 2021-02-02 17:17:23,286
[om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send
om1->om3#0-t1,notify:(t:1, i:337)
om1_1 | 2021-02-02 17:17:23,286
[om1@group-D66704EFC61C->om2-GrpcLogAppender-LogAppenderDaemon] INFO
server.GrpcLogAppender: om1@group-D66704EFC61C->om2-GrpcLogAppender: send
om1->om2#0-t1,notify:(t:1, i:337)
om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO
server.GrpcLogAppender:
om1@group-D66704EFC61C->om3-InstallSnapshotResponseHandler: received a reply
om1<-om3#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336
om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO
server.GrpcLogAppender:
om1@group-D66704EFC61C->om3-InstallSnapshotResponseHandler: Already Installed
Snapshot Index 336.
om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO
leader.FollowerInfo: om1@group-D66704EFC61C->om3: snapshotIndex:
setUnconditionally 0 -> 336
om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO
leader.FollowerInfo: om1@group-D66704EFC61C->om3: matchIndex:
setUnconditionally 336 -> 336
om1_1 | 2021-02-02 17:17:23,523 [grpc-default-executor-1] INFO
leader.FollowerInfo: om1@group-D66704EFC61C->om3: nextIndex: setUnconditionally
337 -> 337
om1_1 | 2021-02-02 17:17:23,523 [grpc-default-executor-1] INFO
leader.FollowerInfo: om1@group-D66704EFC61C->om3: nextIndex: updateToMax
old=337, new=337, updated? false
om1_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-1] INFO
server.GrpcLogAppender:
om1@group-D66704EFC61C->om2-InstallSnapshotResponseHandler: received a reply
om1<-om2#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336
om1_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-1] INFO
server.GrpcLogAppender:
om1@group-D66704EFC61C->om2-InstallSnapshotResponseHandler: Already Installed
Snapshot Index 336.
{code}
On follower om2:
{code}
om2_1 | 2021-02-02 17:17:23,306 [grpc-default-executor-0] INFO
server.RaftServer$Division: om2@group-D66704EFC61C: receive installSnapshot:
om1->om2#0-t1,notify:(t:1, i:337)
om2_1 | 2021-02-02 17:17:23,312 [grpc-default-executor-0] INFO
server.RaftServer$Division: om2@group-D66704EFC61C: StateMachine snapshotIndex
is 336
om2_1 | 2021-02-02 17:17:23,560 [grpc-default-executor-0] INFO
server.RaftServer$Division: om2@group-D66704EFC61C: set new configuration
configurationEntry {
om2_1 | peers {
om2_1 | id: "om1"
om2_1 | address: "om1:9872"
om2_1 | }
om2_1 | peers {
om2_1 | id: "om3"
om2_1 | address: "om3:9872"
om2_1 | }
om2_1 | peers {
om2_1 | id: "om2"
om2_1 | address: "om2:9872"
om2_1 | }
om2_1 | }
om2_1 | from snapshot
om2_1 | 2021-02-02 17:17:23,561 [grpc-default-executor-0] INFO
server.RaftServer$Division: om2@group-D66704EFC61C: set configuration 0:
[om1|rpc:om1:9872|dataStream:|priority:0,
om3|rpc:om3:9872|dataStream:|priority:0,
om2|rpc:om2:9872|dataStream:|priority:0], old=null
om2_1 | 2021-02-02 17:17:23,567 [grpc-default-executor-0] INFO
server.RaftServer$Division: om2@group-D66704EFC61C: reply installSnapshot:
om1<-om2#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336
om2_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-0] INFO
server.GrpcServerProtocolService: om2: Completed INSTALL_SNAPSHOT, lastRequest:
om1->om2#0-t1,notify:(t:1, i:337)
{code}
These log messages are repeated forever until the cluster is terminated. The
term and index numbers do not change.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)