[ https://issues.apache.org/jira/browse/TINKERPOP-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17420651#comment-17420651 ]
ASF GitHub Bot commented on TINKERPOP-2569: ------------------------------------------- divijvaidya commented on pull request #1478: URL: https://github.com/apache/tinkerpop/pull/1478#issuecomment-927738373 Let me try to phrase it as I understand the situation and tell me if I missed anything. As per the current code base, the definition of client initialisation is that successful connections must have been established with at least one host in the cluster. Currently, when the client is trying to send it's first request, `init()` will be called. `init()` tries to initializes connections across all hosts registered in the cluster and if one of the hosts is down, client will log an error. Note that if the initialization fails, the client will not throw an error or provide any response indicating that initialization has failed. This is intended behaviour because as per the definition of "successful initialization", a client is successful if at least one of the host has connected successfully. If no host is successful, then the client initilization **would not fail with a `NoHostAvailableException` exception**. This is because we only throw this exception if the host has been marked as "available" in the cluster. It is not necessary for a host to have active connections to be marked as "available" in the cluster. Hence, even though connection initilization to the host has failed, host is still marked as "active". This will lead to successfully setting `Client.initialized` to true which is incorrect based on the definition of client initialization. The code change in this PR sets a flag `noLiveHostAvailable` if at least one of the hosts fail during client initialization. On the next request, it then tries to init() the client again for all the hosts registered in the cluster. Let's consider how this change will perform in different cases. Case#1: One host registered to cluster and the host is down during client init In such cases, init() will not throw a `NoHostAvailableException` exception. `Client.initialized` will be set to true and `Client.noLiveHostAvailable` will be false. When a second request is made to the client, it will try to init() again and if the host is alive at that time, it will succeed. Note that the new flag is doing what the original `Client.initialized` flag should have been doing here. Case#2: Multiple hosts registered and only one host is down during client init In such cases, init() will not throw a `NoHostAvailableException` exception. `Client.initialized` will be set to true and `Client.noLiveHostAvailable` will be false. When a second request is made to the client, it will try to init() again against all the hosts. This will lead to successful hosts also creating a new connection pool (which is wrong and wasteful). Due to above problems in case#2, this PR is not a suitable change. I would propose the following alternative: A client should: 1. Mark the host as available in the cluster only after connections to the host have been initialised. 2. During client initialization, hosts with successful connections should be marked as available and hosts with unsuccessful connection should be marked as unavailable. 3. If no hosts are successful during client init, NoHostAvailable exception should be thrown to the user and let the user handle this scenario. This is because different applications might have different behaviour. Some might want to fail their application startup while some might want to keep retrying until successful. We should provide this flexibility to the applications to decide this for themselves. 4. If some hosts are successful and some aren't, we should mark the unavailable ones as `considerHostUnavailable()`. This would keep on retrying the failed hosts in the backend while successful hosts are serving existing connections. I would suggest you to make the changes to follow the above points and that will make the original test pass (since init() will be called on every call for case of a single host, since client.initialized will be false) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@tinkerpop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Reconnect to server if Java driver fails to initialize > ------------------------------------------------------ > > Key: TINKERPOP-2569 > URL: https://issues.apache.org/jira/browse/TINKERPOP-2569 > Project: TinkerPop > Issue Type: Bug > Components: driver > Affects Versions: 3.4.11 > Reporter: Stephen Mallette > Priority: Minor > > As reported here on SO: > https://stackoverflow.com/questions/67586427/how-to-recover-with-a-retry-from-gremlin-nohostavailableexception > If the host is unavailable at {{Client}} initialization then the host is not > put in a state where reconnect is possible. Essentially, this test for > {{GremlinServerIntegrateTest}} should pass: > {code} > @Test > public void shouldFailOnInitiallyDeadHost() throws Exception { > // start test with no server > this.stopServer(); > final Cluster cluster = TestClientFactory.build().create(); > final Client client = cluster.connect(); > try { > // try to re-issue a request now that the server is down > client.submit("g").all().get(3000, TimeUnit.MILLISECONDS); > fail("Should throw an exception."); > } catch (RuntimeException re) { > // Client would have no active connections to the host, hence it > would encounter a timeout > // trying to find an alive connection to the host. > assertThat(re.getCause(), > instanceOf(NoHostAvailableException.class)); > // > // should recover when the server comes back > // > // restart server > this.startServer(); > // try a bunch of times to reconnect. on slower systems this may > simply take longer...looking at you travis > for (int ix = 1; ix < 11; ix++) { > // the retry interval is 1 second, wait a bit longer > TimeUnit.SECONDS.sleep(5); > try { > final List<Result> results = > client.submit("1+1").all().get(3000, TimeUnit.MILLISECONDS); > assertEquals(1, results.size()); > assertEquals(2, results.get(0).getInt()); > } catch (Exception ex) { > if (ix == 10) > fail("Should have eventually succeeded"); > } > } > } finally { > cluster.close(); > } > } > {code} > Note that there is a similar test that first allows a connect to a host and > then kills it and then restarts it again called {{shouldFailOnDeadHost()}} > which demonstrates that reconnection works in that situation. > I thought it might be an easy to fix to simply call > {{considerHostUnavailable()}} in the {{ConnectionPool}} constructor in the > event of a {{CompletionException}} which should kickstart the reconnect > process. The reconnects started firing but they all failed for some reason. I > didn't have time to investigate further than than. > Currently the only workaround is to recreate the `Client` if this sort of > situation occurs. -- This message was sent by Atlassian Jira (v8.3.4#803005)