[
https://issues.apache.org/jira/browse/RATIS-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689386#comment-17689386
]
Tsz-wo Sze commented on RATIS-1782:
-----------------------------------
[~William Song], It seems that there were multiple requests have
DEADLINE_EXCEEDED in the old snapshot installation. They triggered
resetClient(..) multiple times. Could you see if we check isDone() below would
solve the problem?
{code}
@@ -557,6 +562,9 @@ public class GrpcLogAppender extends LogAppenderBase {
LOG.info("{} is stopped", GrpcLogAppender.this);
return;
}
+ if (isDone()) {
+ return;
+ }
GrpcUtil.warn(LOG, () -> this + ": Failed InstallSnapshot", t);
grpcServerMetrics.onRequestRetry(); // Update try counter
resetClient(null, true);
{code}
> gRPC installSnapshot timeout handler malfunctioning
> ----------------------------------------------------
>
> Key: RATIS-1782
> URL: https://issues.apache.org/jira/browse/RATIS-1782
> Project: Ratis
> Issue Type: Bug
> Components: gRPC, snapshot
> Affects Versions: 2.4.1
> Reporter: Song Ziyang
> Priority: Blocker
>
> When gRPC logAppender fails to install a snapshot to a follower owing to
> timeout, the onError callback will be invoked and resetClient is called.
> However, in this resetClient[1] handler, installSnapshotResponseHandler is
> not set to null (compared to appendLogReponseHandler). In this way, pending
> RPCs in the old installSnapshot pipe will timeout and call the onError again
> sometime in the future, disrupting future on-going installSnapshot requests.
> [1]
> https://github.com/apache/ratis/blob/18eacaed31e4965a9c400d86409a88fea21fc18a/ratis-grpc/src/main/java/org/apache/ratis/grpc/server/GrpcLogAppender.java#L117-L120
--
This message was sent by Atlassian Jira
(v8.20.10#820010)