[ 
https://issues.apache.org/jira/browse/RATIS-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17688796#comment-17688796
 ] 

Song Ziyang commented on RATIS-1782:
------------------------------------

We are observing constant installSnapshot failures in our cluster.

The observation is, an on-going installSnapshot is cancelled and reseted all of 
a sudden, due to 'TimeoutException: DEADLINE exceeded'.

However, this timeout exception seems not to be caused by this on-going 
installSnapshot request, as the first chunk of this request starts transmitting 
only hundred milliseconds ago, not matching the 3-seconds deadline as the 
exception suggests.

So we suspect that the previous failed installSnapshot requests are disrupting 
this on-going request. And we do observe a lot of failed requests seconds 
earlier in the cluster.

 

 

 

2023-02-15 09:34:14,246 [grpc-default-executor-47] WARN  
o.a.ratis.util.LogUtils:124 - 
9@group-000100000003->6-InstallSnapshotResponseHandler: Failed InstallSnapshot
org.apache.ratis.protocol.exceptions.TimeoutIOException: deadline exceeded 
after 2.999902604s. [remote_addr=172.16.2.5/172.16.2.5:10760]
        at org.apache.ratis.grpc.GrpcUtil.tryUnwrapException(GrpcUtil.java:105)
        at org.apache.ratis.grpc.GrpcUtil.unwrapThrowable(GrpcUtil.java:89)
        at org.apache.ratis.grpc.GrpcUtil.warn(GrpcUtil.java:185)
        at 
org.apache.ratis.grpc.server.GrpcLogAppender$InstallSnapshotResponseHandler.onError(GrpcLogAppender.java:564)
        at 
org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:487)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:562)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:743)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:722)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at 
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
DEADLINE_EXCEEDED: deadline exceeded after 2.999902604s. 
[remote_addr=172.16.2.5/172.16.2.5:10760]
        at 
org.apache.ratis.thirdparty.io.grpc.Status.asRuntimeException(Status.java:535)
        ... 10 common frames omitted

> gRPC installSnapshot timeout handler malfunctioning 
> ----------------------------------------------------
>
>                 Key: RATIS-1782
>                 URL: https://issues.apache.org/jira/browse/RATIS-1782
>             Project: Ratis
>          Issue Type: Bug
>          Components: gRPC, snapshot
>    Affects Versions: 2.4.1
>            Reporter: Song Ziyang
>            Priority: Blocker
>
> When gRPC logAppender fails to install a snapshot to a follower owing to 
> timeout, the onError callback will be invoked and resetClient is called. 
> However, in this resetClient[1] handler, installSnapshotResponseHandler is 
> not set to null (compared to  appendLogReponseHandler). In this way, pending 
> RPCs in the old installSnapshot pipe will timeout and call the onError again 
> sometime in the future, disrupting future on-going installSnapshot requests.
> [1] 
> https://github.com/apache/ratis/blob/18eacaed31e4965a9c400d86409a88fea21fc18a/ratis-grpc/src/main/java/org/apache/ratis/grpc/server/GrpcLogAppender.java#L117-L120



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to