[
https://issues.apache.org/jira/browse/RATIS-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
guangbao zhao updated RATIS-2145:
---------------------------------
Description:
We discovered a problem when writing tests with high concurrency. It often
happens that a follower is running well and then triggers takeSnalshot.
The following is the relevant log.
follower: (as the follower log says, between 2024/08/22 20:18:14,044 and
2024/08/22 20:21:57,058, no other logs appeared in the follower, but the
follower election was not triggered, indicating that the leader gave The
heartbeat sent by the follower is successful)
{code:java}
2024/08/22 20:18:13,987 [node1@group-4F53D3317400-StateMachineUpdater] INFO
org.apache.ratis.server.raftlog.RaftLog:
node1@group-4F53D3317400-SegmentedRaftLog: snapshotIndex: updateIncreasingly
22436696498 -> 22441096501
2024/08/22 20:18:13,999 [node1@group-4F53D3317400-SegmentedRaftLogWorker] INFO
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker:
node1@group-4F53D3317400-SegmentedRaftLogWorker: created new log segment
/home/work/ssd1/lavafs/aktst-private/metaserver/metadata/ratis/23d5405d-0e30-3d56-9a77-4f53d3317400/current/log_inprogress_22441098615
2024/08/22 20:18:14,044 [node1@group-4F53D3317400-SegmentedRaftLogWorker] INFO
org.apache.ratis.server.raftlog.RaftLog:
node1@group-4F53D3317400-SegmentedRaftLog: purgeIndex: updateToMax
old=22432683959, new=22437078979, updated? true
2024/08/22 20:21:57,058 [grpc-default-executor-23] INFO
com.xxx.RaftJournalManager: Received install snapshot notification from
MetaStore leader: node3 with term index: (t:192, i:22441477801)
2024/08/22 20:21:57,059 [InstallSnapshotThread] INFO
com.xxx.MetaStoreRatisSnapshotProvider: Downloading latest checkpoint from
Leader MetaStore node3. Checkpoint address: leader:8170
2024/08/22 20:21:57,064 [grpc-default-executor-23] INFO
org.apache.ratis.grpc.server.GrpcServerProtocolService: node1: Completed
INSTALL_SNAPSHOT, lastRequest: node3->node1#0-t192,notify:(t:192, i:22441477801)
2024/08/22 20:21:57,065 [grpc-default-executor-23] INFO
org.apache.ratis.grpc.server.GrpcServerProtocolService: node1: Completed
INSTALL_SNAPSHOT, lastReply: null {code}
leader:
{code:java}
2024/08/22 20:18:16,958 [timer5] WARN
org.apache.ratis.grpc.server.GrpcLogAppender:
node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries,
errorCount=1,
request=AppendEntriesRequest:cid=16867241,entriesCount=25,entries=(t:192,
i:22441098598)...(t:192, i:22441098622)
2024/08/22 20:18:16,964 [timer3] WARN
org.apache.ratis.grpc.server.GrpcLogAppender:
node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries,
errorCount=1,
request=AppendEntriesRequest:cid=16867246,entriesCount=1,entry=(t:192,
i:22441098624)
2024/08/22 20:18:16,964 [timer6] WARN
org.apache.ratis.grpc.server.GrpcLogAppender:
node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries,
errorCount=1,
request=AppendEntriesRequest:cid=16867247,entriesCount=1,entry=(t:192,
i:22441098625)
2024/08/22 20:18:16,964 [timer7] WARN
org.apache.ratis.grpc.server.GrpcLogAppender:
node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries,
errorCount=1,
request=AppendEntriesRequest:cid=16867245,entriesCount=1,entry=(t:192,
i:22441098623)
2024/08/22 20:18:16,965 [timer3] WARN
org.apache.ratis.grpc.server.GrpcLogAppender:
node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries,
errorCount=1,
request=AppendEntriesRequest:cid=16867255,entriesCount=1,entry=(t:192,
i:22441098627)
2024/08/22 20:18:16,965 [timer7] WARN
org.apache.ratis.grpc.server.GrpcLogAppender:
node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries,
errorCount=1,
request=AppendEntriesRequest:cid=16867253,entriesCount=1,entry=(t:192,
i:22441098626)
2024/08/22 20:18:16,979 [timer0] WARN
org.apache.ratis.grpc.server.GrpcLogAppender:
node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries,
errorCount=1,
request=AppendEntriesRequest:cid=16867406,entriesCount=13,entries=(t:192,
i:22441098628)...(t:192, i:22441098640)
2024/08/22 20:18:16,981 [timer5] WARN
org.apache.ratis.grpc.server.GrpcLogAppender:
node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries,
errorCount=1,
request=AppendEntriesRequest:cid=16867411,entriesCount=1,entry=(t:192,
i:22441098641)
2024/08/22 20:21:52,849 [node3@group-4F53D3317400-StateMachineUpdater] INFO
com.xxx.RaftJournalManager: Current LastApplied Index (t:192, i:22445496444)
2024/08/22 20:21:57,021 [node3@group-4F53D3317400-StateMachineUpdater] INFO
org.apache.ratis.server.impl.StateMachineUpdater:
node3@group-4F53D3317400-StateMachineUpdater: Took a snapshot at index
22445496444
2024/08/22 20:21:57,021 [node3@group-4F53D3317400-StateMachineUpdater] INFO
org.apache.ratis.server.impl.StateMachineUpdater:
node3@group-4F53D3317400-StateMachineUpdater: snapshotIndex: updateIncreasingly
22441096438 -> 22445496444
2024/08/22 20:21:57,021 [node3@group-4F53D3317400-StateMachineUpdater] INFO
org.apache.ratis.server.raftlog.RaftLog:
node3@group-4F53D3317400-SegmentedRaftLog: purge 22441496502
2024/08/22 20:21:57,021 [node3@group-4F53D3317400-StateMachineUpdater] INFO
org.apache.ratis.server.raftlog.RaftLog:
node3@group-4F53D3317400-SegmentedRaftLog: snapshotIndex: updateIncreasingly
22441096438 -> 22445496444
2024/08/22 20:21:57,021
[node3@group-4F53D3317400->node1-GrpcLogAppender-LogAppenderDaemon] INFO
org.apache.ratis.grpc.server.GrpcLogAppender:
node3@group-4F53D3317400->node1-GrpcLogAppender: notifyInstallSnapshot with
firstAvailable=(t:192, i:22441477801), followerNextIndex=22441098642 {code}
Because the leader did not receive the onNext callback within the
requestTimeoutDuration(3s) time, the above warn log was
generated.(raft.grpc.server.leader.outstanding.appends.max is set to 8)
{code:java}
scheduler.onTimeout(requestTimeoutDuration,
() -> timeoutAppendRequest(request.getCallId(), request.isHeartbeat()),
LOG, () -> "Timeout check failed for append entry request: " + request);
{code}
and see through code:
{code:java}
private boolean haveTooManyPendingRequests() {
final int size = pendingRequests.logRequestsSize();
if (size == 0) {
return false;
} else if (size >= maxPendingRequestsNum) {
return true;
} else {
// queue is non-empty and non-full
return !replyState.isFirstReplyReceived();
}
} {code}
When the size of pendingRequests.logRequestsSize() is >= 8, log entries will
not be sent, only heartbeats will be sent.(This also verifies that there is a
log gap in the follower log and no election is taking place)
And after the leader meets the purge log conditions, the leader will trigger
notifyInstallSnapshot to the follower. At this time, the nextIndex of the
follower is 22441098642, and the last log of the warn is 22441098641 (exactly
22441098642 - 1), indicating that the follower has received the above warning
logs. It succeeded, but there was no callback to the leader's onNext method to
execute the pendingRequests.remove(reply) method.
> Follower hangs until the next trigger to take a snapshot
> ---------------------------------------------------------
>
> Key: RATIS-2145
> URL: https://issues.apache.org/jira/browse/RATIS-2145
> Project: Ratis
> Issue Type: Bug
> Affects Versions: 3.0.1
> Reporter: guangbao zhao
> Priority: Major
>
> We discovered a problem when writing tests with high concurrency. It often
> happens that a follower is running well and then triggers takeSnalshot.
> The following is the relevant log.
> follower: (as the follower log says, between 2024/08/22 20:18:14,044 and
> 2024/08/22 20:21:57,058, no other logs appeared in the follower, but the
> follower election was not triggered, indicating that the leader gave The
> heartbeat sent by the follower is successful)
>
> {code:java}
> 2024/08/22 20:18:13,987 [node1@group-4F53D3317400-StateMachineUpdater] INFO
> org.apache.ratis.server.raftlog.RaftLog:
> node1@group-4F53D3317400-SegmentedRaftLog: snapshotIndex: updateIncreasingly
> 22436696498 -> 22441096501
> 2024/08/22 20:18:13,999 [node1@group-4F53D3317400-SegmentedRaftLogWorker]
> INFO org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker:
> node1@group-4F53D3317400-SegmentedRaftLogWorker: created new log segment
> /home/work/ssd1/lavafs/aktst-private/metaserver/metadata/ratis/23d5405d-0e30-3d56-9a77-4f53d3317400/current/log_inprogress_22441098615
> 2024/08/22 20:18:14,044 [node1@group-4F53D3317400-SegmentedRaftLogWorker]
> INFO org.apache.ratis.server.raftlog.RaftLog:
> node1@group-4F53D3317400-SegmentedRaftLog: purgeIndex: updateToMax
> old=22432683959, new=22437078979, updated? true
> 2024/08/22 20:21:57,058 [grpc-default-executor-23] INFO
> com.xxx.RaftJournalManager: Received install snapshot notification from
> MetaStore leader: node3 with term index: (t:192, i:22441477801)
> 2024/08/22 20:21:57,059 [InstallSnapshotThread] INFO
> com.xxx.MetaStoreRatisSnapshotProvider: Downloading latest checkpoint from
> Leader MetaStore node3. Checkpoint address: leader:8170
> 2024/08/22 20:21:57,064 [grpc-default-executor-23] INFO
> org.apache.ratis.grpc.server.GrpcServerProtocolService: node1: Completed
> INSTALL_SNAPSHOT, lastRequest: node3->node1#0-t192,notify:(t:192,
> i:22441477801)
> 2024/08/22 20:21:57,065 [grpc-default-executor-23] INFO
> org.apache.ratis.grpc.server.GrpcServerProtocolService: node1: Completed
> INSTALL_SNAPSHOT, lastReply: null {code}
> leader:
>
>
> {code:java}
> 2024/08/22 20:18:16,958 [timer5] WARN
> org.apache.ratis.grpc.server.GrpcLogAppender:
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries,
> errorCount=1,
> request=AppendEntriesRequest:cid=16867241,entriesCount=25,entries=(t:192,
> i:22441098598)...(t:192, i:22441098622)
> 2024/08/22 20:18:16,964 [timer3] WARN
> org.apache.ratis.grpc.server.GrpcLogAppender:
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries,
> errorCount=1,
> request=AppendEntriesRequest:cid=16867246,entriesCount=1,entry=(t:192,
> i:22441098624)
> 2024/08/22 20:18:16,964 [timer6] WARN
> org.apache.ratis.grpc.server.GrpcLogAppender:
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries,
> errorCount=1,
> request=AppendEntriesRequest:cid=16867247,entriesCount=1,entry=(t:192,
> i:22441098625)
> 2024/08/22 20:18:16,964 [timer7] WARN
> org.apache.ratis.grpc.server.GrpcLogAppender:
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries,
> errorCount=1,
> request=AppendEntriesRequest:cid=16867245,entriesCount=1,entry=(t:192,
> i:22441098623)
> 2024/08/22 20:18:16,965 [timer3] WARN
> org.apache.ratis.grpc.server.GrpcLogAppender:
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries,
> errorCount=1,
> request=AppendEntriesRequest:cid=16867255,entriesCount=1,entry=(t:192,
> i:22441098627)
> 2024/08/22 20:18:16,965 [timer7] WARN
> org.apache.ratis.grpc.server.GrpcLogAppender:
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries,
> errorCount=1,
> request=AppendEntriesRequest:cid=16867253,entriesCount=1,entry=(t:192,
> i:22441098626)
> 2024/08/22 20:18:16,979 [timer0] WARN
> org.apache.ratis.grpc.server.GrpcLogAppender:
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries,
> errorCount=1,
> request=AppendEntriesRequest:cid=16867406,entriesCount=13,entries=(t:192,
> i:22441098628)...(t:192, i:22441098640)
> 2024/08/22 20:18:16,981 [timer5] WARN
> org.apache.ratis.grpc.server.GrpcLogAppender:
> node3@group-4F53D3317400->node1-GrpcLogAppender: Timed out appendEntries,
> errorCount=1,
> request=AppendEntriesRequest:cid=16867411,entriesCount=1,entry=(t:192,
> i:22441098641)
> 2024/08/22 20:21:52,849 [node3@group-4F53D3317400-StateMachineUpdater] INFO
> com.xxx.RaftJournalManager: Current LastApplied Index (t:192, i:22445496444)
> 2024/08/22 20:21:57,021 [node3@group-4F53D3317400-StateMachineUpdater] INFO
> org.apache.ratis.server.impl.StateMachineUpdater:
> node3@group-4F53D3317400-StateMachineUpdater: Took a snapshot at index
> 22445496444
> 2024/08/22 20:21:57,021 [node3@group-4F53D3317400-StateMachineUpdater] INFO
> org.apache.ratis.server.impl.StateMachineUpdater:
> node3@group-4F53D3317400-StateMachineUpdater: snapshotIndex:
> updateIncreasingly 22441096438 -> 22445496444
> 2024/08/22 20:21:57,021 [node3@group-4F53D3317400-StateMachineUpdater] INFO
> org.apache.ratis.server.raftlog.RaftLog:
> node3@group-4F53D3317400-SegmentedRaftLog: purge 22441496502
> 2024/08/22 20:21:57,021 [node3@group-4F53D3317400-StateMachineUpdater] INFO
> org.apache.ratis.server.raftlog.RaftLog:
> node3@group-4F53D3317400-SegmentedRaftLog: snapshotIndex: updateIncreasingly
> 22441096438 -> 22445496444
> 2024/08/22 20:21:57,021
> [node3@group-4F53D3317400->node1-GrpcLogAppender-LogAppenderDaemon] INFO
> org.apache.ratis.grpc.server.GrpcLogAppender:
> node3@group-4F53D3317400->node1-GrpcLogAppender: notifyInstallSnapshot with
> firstAvailable=(t:192, i:22441477801), followerNextIndex=22441098642 {code}
>
> Because the leader did not receive the onNext callback within the
> requestTimeoutDuration(3s) time, the above warn log was
> generated.(raft.grpc.server.leader.outstanding.appends.max is set to 8)
>
> {code:java}
> scheduler.onTimeout(requestTimeoutDuration,
> () -> timeoutAppendRequest(request.getCallId(), request.isHeartbeat()),
> LOG, () -> "Timeout check failed for append entry request: " + request);
> {code}
> and see through code:
>
> {code:java}
> private boolean haveTooManyPendingRequests() {
> final int size = pendingRequests.logRequestsSize();
> if (size == 0) {
> return false;
> } else if (size >= maxPendingRequestsNum) {
> return true;
> } else {
> // queue is non-empty and non-full
> return !replyState.isFirstReplyReceived();
> }
> } {code}
> When the size of pendingRequests.logRequestsSize() is >= 8, log entries will
> not be sent, only heartbeats will be sent.(This also verifies that there is a
> log gap in the follower log and no election is taking place)
> And after the leader meets the purge log conditions, the leader will trigger
> notifyInstallSnapshot to the follower. At this time, the nextIndex of the
> follower is 22441098642, and the last log of the warn is 22441098641 (exactly
> 22441098642 - 1), indicating that the follower has received the above warning
> logs. It succeeded, but there was no callback to the leader's onNext method
> to execute the pendingRequests.remove(reply) method.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)