[ 
https://issues.apache.org/jira/browse/RATIS-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689386#comment-17689386
 ] 

Tsz-wo Sze commented on RATIS-1782:
-----------------------------------

[~William Song], It seems that there were multiple requests have 
DEADLINE_EXCEEDED in the old snapshot installation.  They triggered 
resetClient(..) multiple times.  Could you see if we check isDone() below would 
solve the problem?
{code}
@@ -557,6 +562,9 @@ public class GrpcLogAppender extends LogAppenderBase {
         LOG.info("{} is stopped", GrpcLogAppender.this);
         return;
       }
+      if (isDone()) {
+        return;
+      }
       GrpcUtil.warn(LOG, () -> this + ": Failed InstallSnapshot", t);
       grpcServerMetrics.onRequestRetry(); // Update try counter
       resetClient(null, true);
{code}


> gRPC installSnapshot timeout handler malfunctioning 
> ----------------------------------------------------
>
>                 Key: RATIS-1782
>                 URL: https://issues.apache.org/jira/browse/RATIS-1782
>             Project: Ratis
>          Issue Type: Bug
>          Components: gRPC, snapshot
>    Affects Versions: 2.4.1
>            Reporter: Song Ziyang
>            Priority: Blocker
>
> When gRPC logAppender fails to install a snapshot to a follower owing to 
> timeout, the onError callback will be invoked and resetClient is called. 
> However, in this resetClient[1] handler, installSnapshotResponseHandler is 
> not set to null (compared to  appendLogReponseHandler). In this way, pending 
> RPCs in the old installSnapshot pipe will timeout and call the onError again 
> sometime in the future, disrupting future on-going installSnapshot requests.
> [1] 
> https://github.com/apache/ratis/blob/18eacaed31e4965a9c400d86409a88fea21fc18a/ratis-grpc/src/main/java/org/apache/ratis/grpc/server/GrpcLogAppender.java#L117-L120



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to