[ 
https://issues.apache.org/jira/browse/TINKERPOP-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354548#comment-17354548
 ] 

ASF GitHub Bot commented on TINKERPOP-2569:
-------------------------------------------

divijvaidya commented on a change in pull request #1422:
URL: https://github.com/apache/tinkerpop/pull/1422#discussion_r642584008



##########
File path: 
gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ConnectionPool.java
##########
@@ -111,8 +111,11 @@ public ConnectionPool(final Host host, final Client 
client, final Optional<Integ
             logger.warn("Initialization of connections cancelled for {}", 
getPoolInfo(), ce);
             throw ce;
         } catch (CompletionException ce) {
-            // Some connections might have been initialized. Close the 
connection pool gracefully to close them.
-            this.closeAsync();
+            // since there was some error here, the host is likely unavailable 
at this point. if we dont mark it as
+            // such the client won't try to reconnect to the dead host. need 
to initialize this.open because if we
+            // don't then reconnects will fail when trying to create a new one 
with addConnectionIfUnderMaximum()
+            this.open = new AtomicInteger(0);
+            considerHostUnavailable();

Review comment:
       This was intentionally removed in 
https://github.com/apache/tinkerpop/pull/1360 
   
   Here's what happens if we don't remove this:
   
   The connection has failed here. It could be due to misconfiguration such as 
wrong ssl credentials. `considerHostUnavailable()` will create a background 
task which will try to send reconnect message to the host. To send that 
message, it will try to borrowConnection from connection pool. But connection 
pool had a problem during initialization in the first place. It will try to 
initialise connections again. It will fail again due to ssl misconfiguration. 
On failure, it will call `considerHostUnavailable()` again. After some wait, 
borrowConnection will try to create connections again which will again fail and 
eventually after a timeout everything would give up and throw timeout back to 
application. On application side, they will observe TimedOut exception and have 
no clue why the connection failed because the actual reason of failure is 
hidden in logs from 15 min ago.
   
   Another example of complication, on `considerHostUnavailable()` we create a 
new connection for reconnect message asynchronously but right after we also 
destroy all existing connections. In some race condition, we could end up in 
situation where we are adding connection for reconnect and then destroying them 
due to earlier logic.
   
   IMO, to keep things simple, if a client is not able to connect to the server 
on initialisation, it should fail fast with an exception and let application 
layer handle the scenario. The application could choose to recreate the client 
at that stage or fail their application startup and check the logs for 
configuration errors. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Reconnect to server if Java driver fails to initialize
> ------------------------------------------------------
>
>                 Key: TINKERPOP-2569
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP-2569
>             Project: TinkerPop
>          Issue Type: Bug
>          Components: driver
>    Affects Versions: 3.4.11
>            Reporter: Stephen Mallette
>            Priority: Minor
>
> As reported here on SO: 
> https://stackoverflow.com/questions/67586427/how-to-recover-with-a-retry-from-gremlin-nohostavailableexception
> If the host is unavailable at {{Client}} initialization then the host is not 
> put in a state where reconnect is possible. Essentially, this test for 
> {{GremlinServerIntegrateTest}} should pass:
> {code}
> @Test
>     public void shouldFailOnInitiallyDeadHost() throws Exception {
>         // start test with no server
>         this.stopServer();
>         final Cluster cluster = TestClientFactory.build().create();
>         final Client client = cluster.connect();
>         try {
>             // try to re-issue a request now that the server is down
>             client.submit("g").all().get(3000, TimeUnit.MILLISECONDS);
>             fail("Should throw an exception.");
>         } catch (RuntimeException re) {
>             // Client would have no active connections to the host, hence it 
> would encounter a timeout
>             // trying to find an alive connection to the host.
>             assertThat(re.getCause(), 
> instanceOf(NoHostAvailableException.class));
>             //
>             // should recover when the server comes back
>             //
>             // restart server
>             this.startServer();
>             // try a bunch of times to reconnect. on slower systems this may 
> simply take longer...looking at you travis
>             for (int ix = 1; ix < 11; ix++) {
>                 // the retry interval is 1 second, wait a bit longer
>                 TimeUnit.SECONDS.sleep(5);
>                 try {
>                     final List<Result> results = 
> client.submit("1+1").all().get(3000, TimeUnit.MILLISECONDS);
>                     assertEquals(1, results.size());
>                     assertEquals(2, results.get(0).getInt());
>                 } catch (Exception ex) {
>                     if (ix == 10)
>                         fail("Should have eventually succeeded");
>                 }
>             }
>         } finally {
>             cluster.close();
>         }
>     }
> {code}
> Note that there is a similar test that first allows a connect to a host and 
> then kills it and then restarts it again called {{shouldFailOnDeadHost()}} 
> which demonstrates that reconnection works in that situation.
> I thought it might be an easy to fix to simply call 
> {{considerHostUnavailable()}} in the {{ConnectionPool}} constructor in the 
> event of a {{CompletionException}} which should kickstart the reconnect 
> process. The reconnects started firing but they all failed for some reason. I 
> didn't have time to investigate further than than. 
> Currently the only workaround is to recreate the `Client` if this sort of 
> situation occurs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to