[
https://issues.apache.org/jira/browse/RATIS-2156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881897#comment-17881897
]
Ivan Andika commented on RATIS-2156:
------------------------------------
Looking at the logs, the DN encountered a lot of the gRPC calls are cancelled
Leader
{code:java}
2024-09-13 11:20:25,876 [grpc-default-executor-32790] WARN
org.apache.ratis.grpc.server.GrpcLogAppender:
4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc@group-38F8784D0C1B->2d711272-c6fb-449c-8759-f42fbbfb0607-AppendLogResponseHandler:
Failed appendEntries:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED:
RST_STREAM closed stream. HTTP/2 error code: CANCEL
2024-09-13 11:20:25,898 [grpc-default-executor-32790] INFO
org.apache.ratis.server.leader.FollowerInfo:
4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc@group-38F8784D0C1B->2d711272-c6fb-449c-8759-f42fbbfb0607:
nextIndex: updateUnconditionally 275466714 -> 275465958
2024-09-13 11:20:26,004 [grpc-default-executor-32783] INFO
org.apache.ratis.server.leader.FollowerInfo:
4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc@group-38F8784D0C1B->2d711272-c6fb-449c-8759-f42fbbfb0607:
nextIndex: updateUnconditionally 275466057 -> 275465974
2024-09-13 11:20:26,004 [grpc-default-executor-32783] WARN
org.apache.ratis.grpc.server.GrpcLogAppender:
4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc@group-38F8784D0C1B->2d711272-c6fb-449c-8759-f42fbbfb0607-AppendLogResponseHandler:
Failed appendEntries:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED:
RST_STREAM closed stream. HTTP/2 error code: CANCEL
2024-09-13 11:20:26,004 [grpc-default-executor-32783] INFO
org.apache.ratis.server.leader.FollowerInfo:
4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc@group-38F8784D0C1B->2d711272-c6fb-449c-8759-f42fbbfb0607:
nextIndex: updateUnconditionally 275465974 -> 275465958
2024-09-13 11:20:26,106 [grpc-default-executor-32783] WARN
org.apache.ratis.grpc.server.GrpcLogAppender:
4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc@group-38F8784D0C1B->2d711272-c6fb-449c-8759-f42fbbfb0607-AppendLogResponseHandler:
Failed appendEntries:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED:
RST_STREAM closed stream. HTTP/2 error code: CANCEL
2024-09-13 11:20:26,110 [grpc-default-executor-32783] INFO
org.apache.ratis.server.leader.FollowerInfo:
4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc@group-38F8784D0C1B->2d711272-c6fb-449c-8759-f42fbbfb0607:
nextIndex: updateUnconditionally 275466037 -> 275465958
2024-09-13 11:20:26,206 [grpc-default-executor-32790] INFO
org.apache.ratis.server.leader.FollowerInfo:
4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc@group-38F8784D0C1B->2d711272-c6fb-449c-8759-f42fbbfb0607:
nextIndex: updateUnconditionally 275466037 -> 275465974
2024-09-13 11:20:26,206 [grpc-default-executor-32790] WARN
org.apache.ratis.grpc.server.GrpcLogAppender:
4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc@group-38F8784D0C1B->2d711272-c6fb-449c-8759-f42fbbfb0607-AppendLogResponseHandler:
Failed appendEntries:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED:
RST_STREAM closed stream. HTTP/2 error code: CANCEL
2024-09-13 11:20:26,213 [grpc-default-executor-32790] INFO
org.apache.ratis.server.leader.FollowerInfo:
4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc@group-38F8784D0C1B->2d711272-c6fb-449c-8759-f42fbbfb0607:
nextIndex: updateUnconditionally 275465992 -> 275465958
2024-09-13 11:20:26,311 [grpc-default-executor-32790] WARN
org.apache.ratis.grpc.server.GrpcLogAppender:
4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc@group-38F8784D0C1B->2d711272-c6fb-449c-8759-f42fbbfb0607-AppendLogResponseHandler:
Failed appendEntries:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED:
RST_STREAM closed stream. HTTP/2 error code: CANCEL
2024-09-13 11:20:26,314 [grpc-default-executor-32790] INFO
org.apache.ratis.server.leader.FollowerInfo:
4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc@group-38F8784D0C1B->2d711272-c6fb-449c-8759-f42fbbfb0607:
nextIndex: updateUnconditionally 275466057 -> 275465958
2024-09-13 11:20:26,439 [grpc-default-executor-32783] INFO
org.apache.ratis.server.leader.FollowerInfo:
4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc@group-38F8784D0C1B->2d711272-c6fb-449c-8759-f42fbbfb0607:
nextIndex: updateUnconditionally 275466090 -> 275465974
2024-09-13 11:20:26,446 [grpc-default-executor-32783] WARN
org.apache.ratis.grpc.server.GrpcLogAppender:
4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc@group-38F8784D0C1B->2d711272-c6fb-449c-8759-f42fbbfb0607-AppendLogResponseHandler:
Failed appendEntries:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED:
RST_STREAM closed stream. HTTP/2 error code: CANCEL
2024-09-13 11:20:26,446 [grpc-default-executor-32783] INFO
org.apache.ratis.server.leader.FollowerInfo:
4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc@group-38F8784D0C1B->2d711272-c6fb-449c-8759-f42fbbfb0607:
nextIndex: updateUnconditionally 275465974 -> 275465958
2024-09-13 11:20:26,519 [grpc-default-executor-32773] INFO
org.apache.ratis.server.leader.FollowerInfo:
4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc@group-38F8784D0C1B->2d711272-c6fb-449c-8759-f42fbbfb0607:
nextIndex: updateUnconditionally 275466037 -> 275465974
2024-09-13 11:20:26,520 [grpc-default-executor-32773] WARN
org.apache.ratis.grpc.server.GrpcLogAppender:
4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc@group-38F8784D0C1B->2d711272-c6fb-449c-8759-f42fbbfb0607-AppendLogResponseHandler:
Failed appendEntries:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED:
RST_STREAM closed stream. HTTP/2 error code: CANCEL{code}
Follower
{code:java}
2024-09-13 11:20:25,837 [grpc-default-executor-5633] WARN
org.apache.ratis.grpc.server.GrpcServerProtocolService:
2d711272-c6fb-449c-8759-f42fbbfb0607: installSnapshot onError, lastRequest:
4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc->2d711272-c6fb-449c-8759-f42fbbfb0607#635249576-t5,previous=(t:5,
i:275465957),leaderCommit=275465973,initializing? true,entries: size=16,
first=(t:5, i:275465958), STATEMACHINELOGENTRY, 42288156@client-AE31E320960E:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: client
cancelled
2024-09-13 11:20:25,992
[2d711272-c6fb-449c-8759-f42fbbfb0607-server-thread40036] INFO
org.apache.ratis.server.RaftServer$Division:
2d711272-c6fb-449c-8759-f42fbbfb0607@group-38F8784D0C1B: Failed appendEntries
as the first entry (index 275465958) already exists (snapshotIndex: -1,
commitIndex: 275465973)
2024-09-13 11:20:25,992
[2d711272-c6fb-449c-8759-f42fbbfb0607-server-thread40036] INFO
org.apache.ratis.server.RaftServer$Division:
2d711272-c6fb-449c-8759-f42fbbfb0607@group-38F8784D0C1B: inconsistency entries.
Reply:4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc<-2d711272-c6fb-449c-8759-f42fbbfb0607#635257589:FAIL-t5,INCONSISTENCY,nextIndex=275465974,followerCommit=275465973,matchIndex=-1
2024-09-13 11:20:25,994 [grpc-default-executor-5632] WARN
org.apache.ratis.grpc.server.GrpcServerProtocolService:
2d711272-c6fb-449c-8759-f42fbbfb0607: installSnapshot onError, lastRequest:
4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc->2d711272-c6fb-449c-8759-f42fbbfb0607#635257589-t5,previous=(t:5,
i:275465957),leaderCommit=275465973,initializing? true,entries: size=16,
first=(t:5, i:275465958), STATEMACHINELOGENTRY, 42288156@client-AE31E320960E:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: client
cancelled
2024-09-13 11:20:26,103
[2d711272-c6fb-449c-8759-f42fbbfb0607-server-thread40036] INFO
org.apache.ratis.server.RaftServer$Division:
2d711272-c6fb-449c-8759-f42fbbfb0607@group-38F8784D0C1B: Failed appendEntries
as the first entry (index 275465958) already exists (snapshotIndex: -1,
commitIndex: 275465973)
2024-09-13 11:20:26,103
[2d711272-c6fb-449c-8759-f42fbbfb0607-server-thread40036] INFO
org.apache.ratis.server.RaftServer$Division:
2d711272-c6fb-449c-8759-f42fbbfb0607@group-38F8784D0C1B: inconsistency entries.
Reply:4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc<-2d711272-c6fb-449c-8759-f42fbbfb0607#635257594:FAIL-t5,INCONSISTENCY,nextIndex=275465974,followerCommit=275465973,matchIndex=-1
2024-09-13 11:20:26,109 [grpc-default-executor-5638] WARN
org.apache.ratis.grpc.server.GrpcServerProtocolService:
2d711272-c6fb-449c-8759-f42fbbfb0607: installSnapshot onError, lastRequest:
4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc->2d711272-c6fb-449c-8759-f42fbbfb0607#635257594-t5,previous=(t:5,
i:275465957),leaderCommit=275465973,initializing? true,entries: size=16,
first=(t:5, i:275465958), STATEMACHINELOGENTRY, 42288156@client-AE31E320960E:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: client
cancelled
2024-09-13 11:20:26,205
[2d711272-c6fb-449c-8759-f42fbbfb0607-server-thread40059] INFO
org.apache.ratis.server.RaftServer$Division:
2d711272-c6fb-449c-8759-f42fbbfb0607@group-38F8784D0C1B: Failed appendEntries
as the first entry (index 275465958) already exists (snapshotIndex: -1,
commitIndex: 275465973)
2024-09-13 11:20:26,205
[2d711272-c6fb-449c-8759-f42fbbfb0607-server-thread40059] INFO
org.apache.ratis.server.RaftServer$Division:
2d711272-c6fb-449c-8759-f42fbbfb0607@group-38F8784D0C1B: inconsistency entries.
Reply:4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc<-2d711272-c6fb-449c-8759-f42fbbfb0607#635257600:FAIL-t5,INCONSISTENCY,nextIndex=275465974,followerCommit=275465973,matchIndex=-1
2024-09-13 11:20:26,206 [grpc-default-executor-5633] WARN
org.apache.ratis.grpc.server.GrpcServerProtocolService:
2d711272-c6fb-449c-8759-f42fbbfb0607: installSnapshot onError, lastRequest:
4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc->2d711272-c6fb-449c-8759-f42fbbfb0607#635257600-t5,previous=(t:5,
i:275465957),leaderCommit=275465973,initializing? true,entries: size=16,
first=(t:5, i:275465958), STATEMACHINELOGENTRY, 42288156@client-AE31E320960E:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: client
cancelled
2024-09-13 11:20:26,310
[2d711272-c6fb-449c-8759-f42fbbfb0607-server-thread40049] INFO
org.apache.ratis.server.RaftServer$Division:
2d711272-c6fb-449c-8759-f42fbbfb0607@group-38F8784D0C1B: Failed appendEntries
as the first entry (index 275465958) already exists (snapshotIndex: -1,
commitIndex: 275465973)
2024-09-13 11:20:26,311 [grpc-default-executor-5638] WARN
org.apache.ratis.grpc.server.GrpcServerProtocolService:
2d711272-c6fb-449c-8759-f42fbbfb0607: installSnapshot onError, lastRequest:
4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc->2d711272-c6fb-449c-8759-f42fbbfb0607#635257612-t5,previous=(t:5,
i:275465957),leaderCommit=275465973,initializing? true,entries: size=16,
first=(t:5, i:275465958), STATEMACHINELOGENTRY, 42288156@client-AE31E320960E:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: client
cancelled
2024-09-13 11:20:26,439
[2d711272-c6fb-449c-8759-f42fbbfb0607-server-thread40038] INFO
org.apache.ratis.server.RaftServer$Division:
2d711272-c6fb-449c-8759-f42fbbfb0607@group-38F8784D0C1B: Failed appendEntries
as the first entry (index 275465958) already exists (snapshotIndex: -1,
commitIndex: 275465973)
2024-09-13 11:20:26,439
[2d711272-c6fb-449c-8759-f42fbbfb0607-server-thread40038] INFO
org.apache.ratis.server.RaftServer$Division:
2d711272-c6fb-449c-8759-f42fbbfb0607@group-38F8784D0C1B: inconsistency entries.
Reply:4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc<-2d711272-c6fb-449c-8759-f42fbbfb0607#635257619:FAIL-t5,INCONSISTENCY,nextIndex=275465974,followerCommit=275465973,matchIndex=-1
2024-09-13 11:20:26,440 [grpc-default-executor-5633] WARN
org.apache.ratis.grpc.server.GrpcServerProtocolService:
2d711272-c6fb-449c-8759-f42fbbfb0607: installSnapshot onError, lastRequest:
4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc->2d711272-c6fb-449c-8759-f42fbbfb0607#635257619-t5,previous=(t:5,
i:275465957),leaderCommit=275465973,initializing? true,entries: size=16,
first=(t:5, i:275465958), STATEMACHINELOGENTRY, 42288156@client-AE31E320960E:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: client
cancelled
2024-09-13 11:20:26,519
[2d711272-c6fb-449c-8759-f42fbbfb0607-server-thread40059] INFO
org.apache.ratis.server.RaftServer$Division:
2d711272-c6fb-449c-8759-f42fbbfb0607@group-38F8784D0C1B: Failed appendEntries
as the first entry (index 275465958) already exists (snapshotIndex: -1,
commitIndex: 275465973)
2024-09-13 11:20:26,519
[2d711272-c6fb-449c-8759-f42fbbfb0607-server-thread40059] INFO
org.apache.ratis.server.RaftServer$Division:
2d711272-c6fb-449c-8759-f42fbbfb0607@group-38F8784D0C1B: inconsistency entries.
Reply:4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc<-2d711272-c6fb-449c-8759-f42fbbfb0607#635257630:FAIL-t5,INCONSISTENCY,nextIndex=275465974,followerCommit=275465973,matchIndex=-1
2024-09-13 11:20:26,520 [grpc-default-executor-5633] WARN
org.apache.ratis.grpc.server.GrpcServerProtocolService:
2d711272-c6fb-449c-8759-f42fbbfb0607: installSnapshot onError, lastRequest:
4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc->2d711272-c6fb-449c-8759-f42fbbfb0607#635257630-t5,previous=(t:5,
i:275465957),leaderCommit=275465973,initializing? true,entries: size=16,
first=(t:5, i:275465958), STATEMACHINELOGENTRY, 42288156@client-AE31E320960E:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: client
cancelled
2024-09-13 11:20:26,611
[2d711272-c6fb-449c-8759-f42fbbfb0607-server-thread40059] INFO
org.apache.ratis.server.RaftServer$Division:
2d711272-c6fb-449c-8759-f42fbbfb0607@group-38F8784D0C1B: Failed appendEntries
as the first entry (index 275465958) already exists (snapshotIndex: -1,
commitIndex: 275465973)
2024-09-13 11:20:26,611
[2d711272-c6fb-449c-8759-f42fbbfb0607-server-thread40059] INFO
org.apache.ratis.server.RaftServer$Division:
2d711272-c6fb-449c-8759-f42fbbfb0607@group-38F8784D0C1B: inconsistency entries.
Reply:4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc<-2d711272-c6fb-449c-8759-f42fbbfb0607#635257636:FAIL-t5,INCONSISTENCY,nextIndex=275465974,followerCommit=275465973,matchIndex=-1
2024-09-13 11:20:26,614 [grpc-default-executor-5632] WARN
org.apache.ratis.grpc.server.GrpcServerProtocolService:
2d711272-c6fb-449c-8759-f42fbbfb0607: installSnapshot onError, lastRequest:
4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc->2d711272-c6fb-449c-8759-f42fbbfb0607#635257636-t5,previous=(t:5,
i:275465957),leaderCommit=275465973,initializing? true,entries: size=16,
first=(t:5, i:275465958), STATEMACHINELOGENTRY, 42288156@client-AE31E320960E:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: client
cancelled
2024-09-13 11:20:26,697
[2d711272-c6fb-449c-8759-f42fbbfb0607-server-thread40049] INFO
org.apache.ratis.server.RaftServer$Division:
2d711272-c6fb-449c-8759-f42fbbfb0607@group-38F8784D0C1B: Failed appendEntries
as the first entry (index 275465958) already exists (snapshotIndex: -1,
commitIndex: 275465973)
2024-09-13 11:20:26,698
[2d711272-c6fb-449c-8759-f42fbbfb0607-server-thread40049] INFO
org.apache.ratis.server.RaftServer$Division:
2d711272-c6fb-449c-8759-f42fbbfb0607@group-38F8784D0C1B: inconsistency entries.
Reply:4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc<-2d711272-c6fb-449c-8759-f42fbbfb0607#635257643:FAIL-t5,INCONSISTENCY,nextIndex=275465974,followerCommit=275465973,matchIndex=-1
2024-09-13 11:20:26,698 [grpc-default-executor-5638] WARN
org.apache.ratis.grpc.server.GrpcServerProtocolService:
2d711272-c6fb-449c-8759-f42fbbfb0607: installSnapshot onError, lastRequest:
4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc->2d711272-c6fb-449c-8759-f42fbbfb0607#635257643-t5,previous=(t:5,
i:275465957),leaderCommit=275465973,initializing? true,entries: size=16,
first=(t:5, i:275465958), STATEMACHINELOGENTRY, 42288156@client-AE31E320960E:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: client
cancelled
2024-09-13 11:20:26,805
[2d711272-c6fb-449c-8759-f42fbbfb0607-server-thread40059] INFO
org.apache.ratis.server.RaftServer$Division:
2d711272-c6fb-449c-8759-f42fbbfb0607@group-38F8784D0C1B: Failed appendEntries
as the first entry (index 275465958) already exists (snapshotIndex: -1,
commitIndex: 275465973)
2024-09-13 11:20:26,812
[2d711272-c6fb-449c-8759-f42fbbfb0607-server-thread40059] INFO
org.apache.ratis.server.RaftServer$Division:
2d711272-c6fb-449c-8759-f42fbbfb0607@group-38F8784D0C1B: inconsistency entries.
Reply:4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc<-2d711272-c6fb-449c-8759-f42fbbfb0607#635257649:FAIL-t5,INCONSISTENCY,nextIndex=275465974,followerCommit=275465973,matchIndex=-1
2024-09-13 11:20:26,813 [grpc-default-executor-5633] WARN
org.apache.ratis.grpc.server.GrpcServerProtocolService:
2d711272-c6fb-449c-8759-f42fbbfb0607: installSnapshot onError, lastRequest:
4a23d1e8-d526-4a4d-8ed1-13ffbab3a5cc->2d711272-c6fb-449c-8759-f42fbbfb0607#635257649-t5,previous=(t:5,
i:275465957),leaderCommit=275465973,initializing? true,entries: size=16,
first=(t:5, i:275465958), STATEMACHINELOGENTRY, 42288156@client-AE31E320960E:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: client
cancelled {code}
Related gRPC-java issues:
[https://github.com/grpc/grpc-java/issues/9131]
[https://github.com/grpc/grpc-java/issues/8856] (This suggests that the
maxInboundMessageSize might be too small, but might be unlikely since Ratis
default configuration is 64MB)
Under normal condition, the index lag should trigger notifyInstallSnapshot
which will automatically trigger pipeline closure. However, due to the above
issue, the notifyInstallSnapshot is not called and therefore it requires manual
pipeline closure. Need to find the reasons of the CANCELLED gRPC logs. Any idea
is appreciated.
> Notify follower slowness based on the log index
> -----------------------------------------------
>
> Key: RATIS-2156
> URL: https://issues.apache.org/jira/browse/RATIS-2156
> Project: Ratis
> Issue Type: Improvement
> Components: Leader
> Reporter: Ivan Andika
> Assignee: Ivan Andika
> Priority: Major
> Attachments: image-2024-09-13-18-54-04-203.png
>
>
> Currently the StateMachine.LeaderEventApi#notifyFollowerSlowness is based on
> raft.server.rpc.slowness.timeout, we saw that sometimes there are some cases
> where the rpc rtt between the leader and follower does not exceed the
> timeout, the difference of the log index between the leader and follower
> keeps increasing, i.e. the slow follower cannot catch up.
> In Ozone, this causes most watch requests with ALL_COMMITTED replication to
> timeout, causing increased latency of writes. It is better to close the
> pipeline if the slow follower cannot catch up.
> !image-2024-09-13-18-54-04-203.png|width=1408,height=244!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)