SzyWilliam commented on PR #876:
URL: https://github.com/apache/ratis/pull/876#issuecomment-1513478739
Thanks very much for the patch! I used this patch and did tests and here is
what I found.
The leader successfully send 175 installSnapshot RPCs to the follower
(thanks to the streaming timeout).
```java
2023-04-18 11:29:51,004 [grpc-default-executor-15] INFO
o.a.r.g.s.GrpcLogAppender$InstallSnapshotResponseHandler:530 -
7@group-000100000002->9-InstallSnapshotResponseHandler: Completed
InstallSnapshot. Reply: serverReply {
requestorId: "7"
replyId: "9"
raftGroupId {
id: "GGGGGGGGGG\000\001\000\000\000\002"
}
success: true
}
term: 1
requestIndex: 175
2023-04-18 11:29:51,004 [grpc-default-executor-15] INFO
o.a.r.s.i.FollowerInfoImpl:126 - Follower 7@group-000100000002->9 acknowledged
installing snapshot
```
However, after the 175th RPC sent from leader, the leader didn't proceed on.
On the contrary, it somehow stopped until 4.4s passed and this installSnapshot
streaming connection was then cancelled and closed by a RST_STREAM.
```java
2023-04-18 11:29:55,480 [grpc-default-executor-15] WARN
o.a.ratis.util.LogUtils:122 -
7@group-000100000002->9-InstallSnapshotResponseHandler: Failed InstallSnapshot:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED:
RST_STREAM closed stream. HTTP/2 error code: CANCEL
```
On the client side, it replies to the 175th installSnapshot RPC normally and
4.4s later it discovered that this stream is cancelled by the leader.
```java
2023-04-18 11:29:50,946 [grpc-default-executor-1] INFO
o.a.r.s.i.SnapshotInstallationHandler:100 - 9@group-000100000002: reply
installSnapshot: 7<-9#0:OK-t1,SUCCESS,requestIndex=138
2023-04-18 11:29:55,423 [grpc-default-executor-0] WARN
o.a.ratis.util.LogUtils:122 - 9: INSTALL_SNAPSHOT onError, lastRequest:
7->9#0-t1,chunk:04d6f0e0-41d8-4a40-b65f-f195bce7a405,175:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: client
cancelled
```
No GC or other abnormalities detected meanwhile.
This situation repeats 12 times, all stuck at installSnapshot RPC index
**175**. Therefore, I guess the 175th is the last chunk of this snapshot and
suspect that there are deadlock situations in streaming installSnapshot
**completion**.
Now it seems that `ServerRequestStreamObserver` is not to blame for this
deadlock. Are there anything that I missed out?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]