[
https://issues.apache.org/jira/browse/RATIS-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ethan Rose updated RATIS-1305:
------------------------------
Description:
After logs have been purged from the leader and followers, the leader
repeatedly attempts to send snapshots to the followers, who reject them because
there have not been any new transactions to apply. The leader continues to send
snapshots infinitely.
Here is an example of the log messages. om1 is the leader, om2 and om3 are
followers.
On the leader om1:
{code:java}
om1_1 | 2021-02-02 17:17:23,261
[om1@group-D66704EFC61C->om2-GrpcLogAppender-LogAppenderDaemon] INFO
server.GrpcLogAppender: om1@group-D66704EFC61C->om2-GrpcLogAppender:
followerNextIndex = 337 but logStartIndex = -1, notify follower to install
snapshot-(t:1, i:337)
om1_1 | 2021-02-02 17:17:23,272
[om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender:
followerNextIndex = 337 but logStartIndex = -1, notify follower to install
snapshot-(t:1, i:337)
om1_1 | 2021-02-02 17:17:23,286
[om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send
om1->om3#0-t1,notify:(t:1, i:337)
om1_1 | 2021-02-02 17:17:23,286
[om1@group-D66704EFC61C->om2-GrpcLogAppender-LogAppenderDaemon] INFO
server.GrpcLogAppender: om1@group-D66704EFC61C->om2-GrpcLogAppender: send
om1->om2#0-t1,notify:(t:1, i:337)
om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO
server.GrpcLogAppender:
om1@group-D66704EFC61C->om3-InstallSnapshotResponseHandler: received a reply
om1<-om3#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336
om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO
server.GrpcLogAppender:
om1@group-D66704EFC61C->om3-InstallSnapshotResponseHandler: Already Installed
Snapshot Index 336.
om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO
leader.FollowerInfo: om1@group-D66704EFC61C->om3: snapshotIndex:
setUnconditionally 0 -> 336
om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO
leader.FollowerInfo: om1@group-D66704EFC61C->om3: matchIndex:
setUnconditionally 336 -> 336
om1_1 | 2021-02-02 17:17:23,523 [grpc-default-executor-1] INFO
leader.FollowerInfo: om1@group-D66704EFC61C->om3: nextIndex: setUnconditionally
337 -> 337
om1_1 | 2021-02-02 17:17:23,523 [grpc-default-executor-1] INFO
leader.FollowerInfo: om1@group-D66704EFC61C->om3: nextIndex: updateToMax
old=337, new=337, updated? false
om1_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-1] INFO
server.GrpcLogAppender:
om1@group-D66704EFC61C->om2-InstallSnapshotResponseHandler: received a reply
om1<-om2#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336
om1_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-1] INFO
server.GrpcLogAppender:
om1@group-D66704EFC61C->om2-InstallSnapshotResponseHandler: Already Installed
Snapshot Index 336.
{code}
On follower om2:
{code:java}
om2_1 | 2021-02-02 17:17:23,306 [grpc-default-executor-0] INFO
server.RaftServer$Division: om2@group-D66704EFC61C: receive installSnapshot:
om1->om2#0-t1,notify:(t:1, i:337)
om2_1 | 2021-02-02 17:17:23,312 [grpc-default-executor-0] INFO
server.RaftServer$Division: om2@group-D66704EFC61C: StateMachine snapshotIndex
is 336
om2_1 | 2021-02-02 17:17:23,560 [grpc-default-executor-0] INFO
server.RaftServer$Division: om2@group-D66704EFC61C: set new configuration
configurationEntry {
om2_1 | peers {
om2_1 | id: "om1"
om2_1 | address: "om1:9872"
om2_1 | }
om2_1 | peers {
om2_1 | id: "om3"
om2_1 | address: "om3:9872"
om2_1 | }
om2_1 | peers {
om2_1 | id: "om2"
om2_1 | address: "om2:9872"
om2_1 | }
om2_1 | }
om2_1 | from snapshot
om2_1 | 2021-02-02 17:17:23,561 [grpc-default-executor-0] INFO
server.RaftServer$Division: om2@group-D66704EFC61C: set configuration 0:
[om1|rpc:om1:9872|dataStream:|priority:0,
om3|rpc:om3:9872|dataStream:|priority:0,
om2|rpc:om2:9872|dataStream:|priority:0], old=null
om2_1 | 2021-02-02 17:17:23,567 [grpc-default-executor-0] INFO
server.RaftServer$Division: om2@group-D66704EFC61C: reply installSnapshot:
om1<-om2#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336
om2_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-0] INFO
server.GrpcServerProtocolService: om2: Completed INSTALL_SNAPSHOT, lastRequest:
om1->om2#0-t1,notify:(t:1, i:337)
{code}
These log messages are repeated forever until the cluster is terminated. The
term and index numbers do not change.
EDIT: I think the cluster not responding was caused by something different.
This behavior is still a bug, however.
was:
After logs have been purged from the leader and followers, the leader
repeatedly attempts to send snapshots to the followers, who reject them because
there have not been any new transactions to apply. The leader continues to send
snapshots infinitely.
Here is an example of the log messages. om1 is the leader, om2 and om3 are
followers.
On the leader om1:
{code:java}
om1_1 | 2021-02-02 17:17:23,261
[om1@group-D66704EFC61C->om2-GrpcLogAppender-LogAppenderDaemon] INFO
server.GrpcLogAppender: om1@group-D66704EFC61C->om2-GrpcLogAppender:
followerNextIndex = 337 but logStartIndex = -1, notify follower to install
snapshot-(t:1, i:337)
om1_1 | 2021-02-02 17:17:23,272
[om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender:
followerNextIndex = 337 but logStartIndex = -1, notify follower to install
snapshot-(t:1, i:337)
om1_1 | 2021-02-02 17:17:23,286
[om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send
om1->om3#0-t1,notify:(t:1, i:337)
om1_1 | 2021-02-02 17:17:23,286
[om1@group-D66704EFC61C->om2-GrpcLogAppender-LogAppenderDaemon] INFO
server.GrpcLogAppender: om1@group-D66704EFC61C->om2-GrpcLogAppender: send
om1->om2#0-t1,notify:(t:1, i:337)
om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO
server.GrpcLogAppender:
om1@group-D66704EFC61C->om3-InstallSnapshotResponseHandler: received a reply
om1<-om3#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336
om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO
server.GrpcLogAppender:
om1@group-D66704EFC61C->om3-InstallSnapshotResponseHandler: Already Installed
Snapshot Index 336.
om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO
leader.FollowerInfo: om1@group-D66704EFC61C->om3: snapshotIndex:
setUnconditionally 0 -> 336
om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO
leader.FollowerInfo: om1@group-D66704EFC61C->om3: matchIndex:
setUnconditionally 336 -> 336
om1_1 | 2021-02-02 17:17:23,523 [grpc-default-executor-1] INFO
leader.FollowerInfo: om1@group-D66704EFC61C->om3: nextIndex: setUnconditionally
337 -> 337
om1_1 | 2021-02-02 17:17:23,523 [grpc-default-executor-1] INFO
leader.FollowerInfo: om1@group-D66704EFC61C->om3: nextIndex: updateToMax
old=337, new=337, updated? false
om1_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-1] INFO
server.GrpcLogAppender:
om1@group-D66704EFC61C->om2-InstallSnapshotResponseHandler: received a reply
om1<-om2#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336
om1_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-1] INFO
server.GrpcLogAppender:
om1@group-D66704EFC61C->om2-InstallSnapshotResponseHandler: Already Installed
Snapshot Index 336.
{code}
On follower om2:
{code:java}
om2_1 | 2021-02-02 17:17:23,306 [grpc-default-executor-0] INFO
server.RaftServer$Division: om2@group-D66704EFC61C: receive installSnapshot:
om1->om2#0-t1,notify:(t:1, i:337)
om2_1 | 2021-02-02 17:17:23,312 [grpc-default-executor-0] INFO
server.RaftServer$Division: om2@group-D66704EFC61C: StateMachine snapshotIndex
is 336
om2_1 | 2021-02-02 17:17:23,560 [grpc-default-executor-0] INFO
server.RaftServer$Division: om2@group-D66704EFC61C: set new configuration
configurationEntry {
om2_1 | peers {
om2_1 | id: "om1"
om2_1 | address: "om1:9872"
om2_1 | }
om2_1 | peers {
om2_1 | id: "om3"
om2_1 | address: "om3:9872"
om2_1 | }
om2_1 | peers {
om2_1 | id: "om2"
om2_1 | address: "om2:9872"
om2_1 | }
om2_1 | }
om2_1 | from snapshot
om2_1 | 2021-02-02 17:17:23,561 [grpc-default-executor-0] INFO
server.RaftServer$Division: om2@group-D66704EFC61C: set configuration 0:
[om1|rpc:om1:9872|dataStream:|priority:0,
om3|rpc:om3:9872|dataStream:|priority:0,
om2|rpc:om2:9872|dataStream:|priority:0], old=null
om2_1 | 2021-02-02 17:17:23,567 [grpc-default-executor-0] INFO
server.RaftServer$Division: om2@group-D66704EFC61C: reply installSnapshot:
om1<-om2#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336
om2_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-0] INFO
server.GrpcServerProtocolService: om2: Completed INSTALL_SNAPSHOT, lastRequest:
om1->om2#0-t1,notify:(t:1, i:337)
{code}
These log messages are repeated forever until the cluster is terminated. The
term and index numbers do not change.
> Leader stuck in infinite install snapshot cycle when logs have been purged
> --------------------------------------------------------------------------
>
> Key: RATIS-1305
> URL: https://issues.apache.org/jira/browse/RATIS-1305
> Project: Ratis
> Issue Type: Bug
> Components: server
> Reporter: Ethan Rose
> Assignee: Ethan Rose
> Priority: Blocker
> Fix For: 1.1.0
>
>
> After logs have been purged from the leader and followers, the leader
> repeatedly attempts to send snapshots to the followers, who reject them
> because there have not been any new transactions to apply. The leader
> continues to send snapshots infinitely.
> Here is an example of the log messages. om1 is the leader, om2 and om3 are
> followers.
> On the leader om1:
> {code:java}
> om1_1 | 2021-02-02 17:17:23,261
> [om1@group-D66704EFC61C->om2-GrpcLogAppender-LogAppenderDaemon] INFO
> server.GrpcLogAppender: om1@group-D66704EFC61C->om2-GrpcLogAppender:
> followerNextIndex = 337 but logStartIndex = -1, notify follower to install
> snapshot-(t:1, i:337)
> om1_1 | 2021-02-02 17:17:23,272
> [om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
> server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender:
> followerNextIndex = 337 but logStartIndex = -1, notify follower to install
> snapshot-(t:1, i:337)
> om1_1 | 2021-02-02 17:17:23,286
> [om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
> server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send
> om1->om3#0-t1,notify:(t:1, i:337)
> om1_1 | 2021-02-02 17:17:23,286
> [om1@group-D66704EFC61C->om2-GrpcLogAppender-LogAppenderDaemon] INFO
> server.GrpcLogAppender: om1@group-D66704EFC61C->om2-GrpcLogAppender: send
> om1->om2#0-t1,notify:(t:1, i:337)
> om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO
> server.GrpcLogAppender:
> om1@group-D66704EFC61C->om3-InstallSnapshotResponseHandler: received a reply
> om1<-om3#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336
> om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO
> server.GrpcLogAppender:
> om1@group-D66704EFC61C->om3-InstallSnapshotResponseHandler: Already Installed
> Snapshot Index 336.
> om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO
> leader.FollowerInfo: om1@group-D66704EFC61C->om3: snapshotIndex:
> setUnconditionally 0 -> 336
> om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO
> leader.FollowerInfo: om1@group-D66704EFC61C->om3: matchIndex:
> setUnconditionally 336 -> 336
> om1_1 | 2021-02-02 17:17:23,523 [grpc-default-executor-1] INFO
> leader.FollowerInfo: om1@group-D66704EFC61C->om3: nextIndex:
> setUnconditionally 337 -> 337
> om1_1 | 2021-02-02 17:17:23,523 [grpc-default-executor-1] INFO
> leader.FollowerInfo: om1@group-D66704EFC61C->om3: nextIndex: updateToMax
> old=337, new=337, updated? false
> om1_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-1] INFO
> server.GrpcLogAppender:
> om1@group-D66704EFC61C->om2-InstallSnapshotResponseHandler: received a reply
> om1<-om2#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336
> om1_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-1] INFO
> server.GrpcLogAppender:
> om1@group-D66704EFC61C->om2-InstallSnapshotResponseHandler: Already Installed
> Snapshot Index 336.
> {code}
>
> On follower om2:
> {code:java}
> om2_1 | 2021-02-02 17:17:23,306 [grpc-default-executor-0] INFO
> server.RaftServer$Division: om2@group-D66704EFC61C: receive installSnapshot:
> om1->om2#0-t1,notify:(t:1, i:337)
> om2_1 | 2021-02-02 17:17:23,312 [grpc-default-executor-0] INFO
> server.RaftServer$Division: om2@group-D66704EFC61C: StateMachine
> snapshotIndex is 336
> om2_1 | 2021-02-02 17:17:23,560 [grpc-default-executor-0] INFO
> server.RaftServer$Division: om2@group-D66704EFC61C: set new configuration
> configurationEntry {
> om2_1 | peers {
> om2_1 | id: "om1"
> om2_1 | address: "om1:9872"
> om2_1 | }
> om2_1 | peers {
> om2_1 | id: "om3"
> om2_1 | address: "om3:9872"
> om2_1 | }
> om2_1 | peers {
> om2_1 | id: "om2"
> om2_1 | address: "om2:9872"
> om2_1 | }
> om2_1 | }
> om2_1 | from snapshot
> om2_1 | 2021-02-02 17:17:23,561 [grpc-default-executor-0] INFO
> server.RaftServer$Division: om2@group-D66704EFC61C: set configuration 0:
> [om1|rpc:om1:9872|dataStream:|priority:0,
> om3|rpc:om3:9872|dataStream:|priority:0,
> om2|rpc:om2:9872|dataStream:|priority:0], old=null
> om2_1 | 2021-02-02 17:17:23,567 [grpc-default-executor-0] INFO
> server.RaftServer$Division: om2@group-D66704EFC61C: reply installSnapshot:
> om1<-om2#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336
> om2_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-0] INFO
> server.GrpcServerProtocolService: om2: Completed INSTALL_SNAPSHOT,
> lastRequest: om1->om2#0-t1,notify:(t:1, i:337)
> {code}
>
> These log messages are repeated forever until the cluster is terminated. The
> term and index numbers do not change.
>
> EDIT: I think the cluster not responding was caused by something different.
> This behavior is still a bug, however.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)