SzyWilliam commented on PR #876:
URL: https://github.com/apache/ratis/pull/876#issuecomment-1513478739

   Thanks very much for the patch! I used this patch and did tests and here is 
what I found.
   The leader successfully send 175 installSnapshot RPCs to the follower 
(thanks to the streaming timeout). 
   ```java
   2023-04-18 11:29:51,004 [grpc-default-executor-15] INFO  
o.a.r.g.s.GrpcLogAppender$InstallSnapshotResponseHandler:530 - 
7@group-000100000002->9-InstallSnapshotResponseHandler: Completed 
InstallSnapshot. Reply: serverReply {
     requestorId: "7"
     replyId: "9"
     raftGroupId {
       id: "GGGGGGGGGG\000\001\000\000\000\002"
     }
     success: true
   }
   term: 1
   requestIndex: 175
    
   2023-04-18 11:29:51,004 [grpc-default-executor-15] INFO  
o.a.r.s.i.FollowerInfoImpl:126 - Follower 7@group-000100000002->9 acknowledged 
installing snapshot 
   ```
   However, after the 175th RPC sent from leader, the leader didn't proceed on. 
On the contrary, it somehow stopped until 4.4s passed and this installSnapshot 
streaming connection was then cancelled and closed by a RST_STREAM.
   ```java
   2023-04-18 11:29:55,480 [grpc-default-executor-15] WARN  
o.a.ratis.util.LogUtils:122 - 
7@group-000100000002->9-InstallSnapshotResponseHandler: Failed InstallSnapshot: 
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: 
RST_STREAM closed stream. HTTP/2 error code: CANCEL
   ```
   
   On the client side, it replies to the 175th installSnapshot RPC normally and 
4.4s later it discovered that this stream is cancelled by the leader.
   ```java
   2023-04-18 11:29:50,946 [grpc-default-executor-1] INFO  
o.a.r.s.i.SnapshotInstallationHandler:100 - 9@group-000100000002: reply 
installSnapshot: 7<-9#0:OK-t1,SUCCESS,requestIndex=138 
   2023-04-18 11:29:55,423 [grpc-default-executor-0] WARN  
o.a.ratis.util.LogUtils:122 - 9: INSTALL_SNAPSHOT onError, lastRequest: 
7->9#0-t1,chunk:04d6f0e0-41d8-4a40-b65f-f195bce7a405,175: 
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: client 
cancelled 
   ```
   
   No GC or other abnormalities detected meanwhile.
   
   This situation repeats 12 times, all stuck at installSnapshot RPC index 
**175**. Therefore, I guess the 175th is the last chunk of this snapshot and 
suspect that there are deadlock situations in streaming installSnapshot 
**completion**.
   Now it seems that `ServerRequestStreamObserver` is not to blame for this 
deadlock. Are there anything that I missed out?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to