[ 
https://issues.apache.org/jira/browse/TINKERPOP-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17420651#comment-17420651
 ] 

ASF GitHub Bot commented on TINKERPOP-2569:
-------------------------------------------

divijvaidya commented on pull request #1478:
URL: https://github.com/apache/tinkerpop/pull/1478#issuecomment-927738373


   Let me try to phrase it as I understand the situation and tell me if I 
missed anything.
   
   As per the current code base, the definition of client initialisation is 
that successful connections must have been established with at least one host 
in the cluster.
   
   Currently, when the client is trying to send it's first request, `init()` 
will be called. `init()` tries to initializes connections across all hosts 
registered in the cluster and if one of the hosts is down, client will log an 
error. Note that if the initialization fails, the client will not throw an 
error or provide any response indicating that initialization has failed. This 
is intended behaviour because as per the definition of "successful 
initialization", a client is successful if at least one of the host has 
connected successfully. 
   
   If no host is successful, then the client initilization **would not fail 
with a `NoHostAvailableException` exception**. This is because we only throw 
this exception if the host has been marked as "available" in the cluster. It is 
not necessary for a host to have active connections to be marked as "available" 
in the cluster. Hence, even though connection initilization to the host has 
failed, host is still marked as "active". This will lead to successfully 
setting `Client.initialized` to true which is incorrect based on the definition 
of client initialization. 
   
   The code change in this PR sets a flag `noLiveHostAvailable` if at least one 
of the hosts fail during client initialization. 
   On the next request, it then tries to init() the client again for all the 
hosts registered in the cluster.
   
   Let's consider how this change will perform in different cases.
   
   Case#1: One host registered to cluster and the host is down during client 
init
   
   In such cases, init() will not throw a `NoHostAvailableException` exception. 
`Client.initialized` will be set to true and `Client.noLiveHostAvailable` will 
be false. When a second request is made to the client, it will try to init() 
again and if the host is alive at that time, it will succeed.
   
   Note that the new flag is doing what the original `Client.initialized` flag 
should have been doing here.
   
   Case#2: Multiple hosts registered and only one host is down during client 
init
   In such cases, init() will not throw a `NoHostAvailableException` exception. 
`Client.initialized` will be set to true and `Client.noLiveHostAvailable` will 
be false. When a second request is made to the client, it will try to init() 
again against all the hosts. This will lead to successful hosts also creating a 
new connection pool (which is wrong and wasteful).
   
   Due to above problems in case#2, this PR is not a suitable change.
   
   I would propose the following alternative:
   
   A client should:
   1. Mark the host as available in the cluster only after connections to the 
host have been initialised.
   2. During client initialization, hosts with successful connections should be 
marked as available and hosts with unsuccessful connection should be marked as 
unavailable.
   3. If no hosts are successful during client init, NoHostAvailable exception 
should be thrown to the user and let the user handle this scenario. This is 
because different applications might have different behaviour. Some might want 
to fail their application startup while some might want to keep retrying until 
successful. We should provide this flexibility to the applications to decide 
this for themselves.
   4. If some hosts are successful and some aren't, we should mark the 
unavailable ones as `considerHostUnavailable()`. This would keep on retrying 
the failed hosts in the backend while successful hosts are serving existing 
connections.
   
   I would suggest you to make the changes to follow the above points and that 
will make the original test pass (since init() will be called on every call for 
case of a single host, since client.initialized will be false)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@tinkerpop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Reconnect to server if Java driver fails to initialize
> ------------------------------------------------------
>
>                 Key: TINKERPOP-2569
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP-2569
>             Project: TinkerPop
>          Issue Type: Bug
>          Components: driver
>    Affects Versions: 3.4.11
>            Reporter: Stephen Mallette
>            Priority: Minor
>
> As reported here on SO: 
> https://stackoverflow.com/questions/67586427/how-to-recover-with-a-retry-from-gremlin-nohostavailableexception
> If the host is unavailable at {{Client}} initialization then the host is not 
> put in a state where reconnect is possible. Essentially, this test for 
> {{GremlinServerIntegrateTest}} should pass:
> {code}
> @Test
>     public void shouldFailOnInitiallyDeadHost() throws Exception {
>         // start test with no server
>         this.stopServer();
>         final Cluster cluster = TestClientFactory.build().create();
>         final Client client = cluster.connect();
>         try {
>             // try to re-issue a request now that the server is down
>             client.submit("g").all().get(3000, TimeUnit.MILLISECONDS);
>             fail("Should throw an exception.");
>         } catch (RuntimeException re) {
>             // Client would have no active connections to the host, hence it 
> would encounter a timeout
>             // trying to find an alive connection to the host.
>             assertThat(re.getCause(), 
> instanceOf(NoHostAvailableException.class));
>             //
>             // should recover when the server comes back
>             //
>             // restart server
>             this.startServer();
>             // try a bunch of times to reconnect. on slower systems this may 
> simply take longer...looking at you travis
>             for (int ix = 1; ix < 11; ix++) {
>                 // the retry interval is 1 second, wait a bit longer
>                 TimeUnit.SECONDS.sleep(5);
>                 try {
>                     final List<Result> results = 
> client.submit("1+1").all().get(3000, TimeUnit.MILLISECONDS);
>                     assertEquals(1, results.size());
>                     assertEquals(2, results.get(0).getInt());
>                 } catch (Exception ex) {
>                     if (ix == 10)
>                         fail("Should have eventually succeeded");
>                 }
>             }
>         } finally {
>             cluster.close();
>         }
>     }
> {code}
> Note that there is a similar test that first allows a connect to a host and 
> then kills it and then restarts it again called {{shouldFailOnDeadHost()}} 
> which demonstrates that reconnection works in that situation.
> I thought it might be an easy to fix to simply call 
> {{considerHostUnavailable()}} in the {{ConnectionPool}} constructor in the 
> event of a {{CompletionException}} which should kickstart the reconnect 
> process. The reconnects started firing but they all failed for some reason. I 
> didn't have time to investigate further than than. 
> Currently the only workaround is to recreate the `Client` if this sort of 
> situation occurs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to