[ 
https://issues.apache.org/jira/browse/RATIS-592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870385#comment-16870385
 ] 

Tsz Wo Nicholas Sze commented on RATIS-592:
-------------------------------------------

> We would always reconnect in case of AlreadyClosedException. This can lead to 
> a reconnection multiple times ...

[~ljain], let's do this optimization separately.  When there are multiple 
AlreadyClosedExceptions, the first one should reconnect but the following are 
not.  If we do not reconnect at all, it will never reconnect, i.e. the bug 
described here.


[~swagle], in RATIS-571, we forgot to call  IOUtils.shouldReconnect(t) in 
GrpcClientRpc.shouldReconnect(..).  Instead of adding AlreadyClosedException 
check in GrpcClientRpc, how about we have the following?
{code}
@@ -61,6 +62,6 @@ public interface RaftClientRpc extends Closeable {
    * @return true if the given throwable should be handled; otherwise, return 
false.
    */
   default boolean shouldReconnect(Throwable t) {
-    return false;
+    return IOUtils.shouldReconnect(t);
   }
 }
{code}
{code}
+++ b/ratis-grpc/src/main/java/org/apache/ratis/grpc/client/GrpcClientRpc.java
@@ -170,6 +170,6 @@ public class GrpcClientRpc extends 
RaftClientRpcWithProxy<GrpcClientProtocolClie
     } else if (e instanceof IllegalArgumentException) {
       return e.getMessage().contains("null frame before EOS");
     }
-    return false;
+    return super.shouldReconnect(e);
   }
 }
{code}

> One node ratis writes fail forever after first NotLeaderException or 
> LeaderNotReadyException
> --------------------------------------------------------------------------------------------
>
>                 Key: RATIS-592
>                 URL: https://issues.apache.org/jira/browse/RATIS-592
>             Project: Ratis
>          Issue Type: Bug
>          Components: gRPC
>    Affects Versions: 0.3.0
>            Reporter: Siddharth Wagle
>            Assignee: Siddharth Wagle
>            Priority: Critical
>             Fix For: 0.4.0
>
>         Attachments: RATIS-592.01.patch, RATIS-592.02.patch, 
> RATIS-592.03.patch
>
>
> RATIS-571, modified the GrpcClientProtocolClient to not set the 
> AsyncStreamObserver reference to null on an exception, however, the ReplyMap 
> reference is set to null. This results in the client getting an 
> AlredyClosedException on the stream on a retry for a NotLeader or a 
> LeadrNotReady exception and never recovers. This is common in a unit test 
> scenario where a request is sent immediately after the cluster is up.
> There is nothing special here about one node Ratis however, the HDDS unit 
> tests that fail are all one node Ratis and the most probable cause is that 
> with client retrying a different node each time, increases the chance of 
> success on a three-node ring.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to