[ 
https://issues.apache.org/jira/browse/TINKERPOP3-814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stephen mallette closed TINKERPOP3-814.
---------------------------------------
    Resolution: Fixed

The solution in the description wasn't too far off from the heart of the 
problem in some sense.  First of all, the {{Host}} was getting marked 
unavailable too early for issues that might have been repairable in a sense.  
For example, a serialization problem might occur for one query but not for 
another (e.g. one query might return a Titan {{RelationIdentifier}} which 
requires a serializer and another might not).  These cases where there is a 
chance that future requests can be made on the {{Host}} it's probably not a 
good idea to kill the channel and mark the {{Host}} dead.  So that was resolved 
TINKERPOP3-816

Then, when the {{Host}} was marked unavailable, nothing was being done to clear 
the connections in the pool and then (for some reason) the driver wasn't doing 
much to respect the pool size and new connections were just being created and 
added to the pool when a reconnect to the {{Host}} was attempted.  So, the 
driver now destroys connections when the {{Host}} is marked as unavailable.  On 
reconnect attempts it looks to create a new connection (through standard 
methods that respect pool size/timeouts/etc) and then issues a "ping" to see if 
the host is online.  The "ping" is kinda dumpy, just a gremlin script with an 
empty string, which should just evaluate back as an empty string, but it does 
the job.  Unfortunately, it's not easy rely on the websocket ping given the way 
the driver is setup now, plus we have to rely on the subprotcol capabilities 
since this code will need to work over the NIO channel or other means so doing 
websocket specific things won't be so good.  Perhaps we could build a "ping" 
built into the subprotocol itself.

It would be nice if I had a clean unit test here to prevent regressions, but 
I'm not completely sure how to write it - our integration testing 
infrastructure isn't advanced enough for the scenario involved i don't think.

> ConnectionPool can fill with dead Connections
> ---------------------------------------------
>
>                 Key: TINKERPOP3-814
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP3-814
>             Project: TinkerPop 3
>          Issue Type: Bug
>          Components: driver
>    Affects Versions: 3.0.1-incubating
>         Environment: "Cluster" of a single gremlin-server/Titan node.  Single 
> client.  minConnectionPoolSize = 2; maxConnectionPoolSize = 10.
>            Reporter: Kieran Sherlock
>            Assignee: stephen mallette
>             Fix For: 3.1.0-incubating, 3.0.2-incubating
>
>
> Client gets a GryoMessageSerializerV1d0 decoding error on a couple of 
> requests (different unfiled issue).   ConnectionPool gets into the situation 
> where there are 2 dead connections being maintained in the pool, but no new 
> connections are being issued.
> A sample of the log output...
> {noformat}
> 2015-08-27 09:57:18,122 [Grizzly-worker(15)] DEBUG 
> org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Borrowing connection 
> from pool on Host{address=localhost/127.0.0.1:18182, 
> hostUri=ws://localhost:18182/gremlin} - timeout in 3000 MILLISECONDS
> 2015-08-27 09:57:18,122 [Grizzly-worker(15)] DEBUG 
> org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Return least used 
> Connection{host=Host{address=localhost/127.0.0.1:18182, 
> hostUri=ws://localhost:18182/gremlin}, isDead=false, borrowed=1, pending=0} 
> on Host{address=localhost/127.0.0.1:18182, 
> hostUri=ws://localhost:18182/gremlin}
> 2015-08-27 09:57:18,122 [Grizzly-worker(15)] DEBUG 
> org.apache.tinkerpop.gremlin.driver.Client  - Submitted RequestMessage{, 
> requestId=0349c1ce-19dc-4d54-9ca9-30965c9314ae, op='eval', processor='', 
> args={<redacted>}} to - 
> Connection{host=Host{address=localhost/127.0.0.1:18182, 
> hostUri=ws://localhost:18182/gremlin}}
> 2015-08-27 09:57:18,125 [gremlin-driver-worker-6] DEBUG 
> org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Attempting to return 
> Connection{host=Host{address=localhost/127.0.0.1:18182, 
> hostUri=ws://localhost:18182/gremlin}} on 
> Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin}
> 2015-08-27 09:57:18,126 [gremlin-driver-worker-6] DEBUG 
> org.apache.tinkerpop.gremlin.driver.ConnectionPool  - On 
> Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin} 
> pool size of 3 > minPoolSize 2 and borrowed of 0 <= 
> minSimultaneousUsagePerConnection 8 so destroy 
> Connection{host=Host{address=localhost/127.0.0.1:18182, 
> hostUri=ws://localhost:18182/gremlin}, isDead=false, borrowed=0, pending=0}
> 2015-08-27 09:57:18,126 [gremlin-driver-worker-6] DEBUG 
> org.apache.tinkerpop.gremlin.driver.ConnectionPool  - 
> Connection{host=Host{address=localhost/127.0.0.1:18182, 
> hostUri=ws://localhost:18182/gremlin}, isDead=false, borrowed=0, pending=0} 
> destroyed
> 2015-08-27 09:57:18,126 [Grizzly-worker(15)] DEBUG 
> org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Borrowing connection 
> from pool on Host{address=localhost/127.0.0.1:18182, 
> hostUri=ws://localhost:18182/gremlin} - timeout in 3000 MILLISECONDS
> 2015-08-27 09:57:18,126 [Grizzly-worker(15)] DEBUG 
> org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Pool was initialized 
> but a connection could not be selected earlier - waiting for connection on 
> Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin}
> 2015-08-27 09:57:18,126 [Grizzly-worker(15)] DEBUG 
> org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Wait 3000 MILLISECONDS 
> for an available connection on Host{address=localhost/127.0.0.1:18182, 
> hostUri=ws://localhost:18182/gremlin} with Thread[Grizzly-worker(15),5,main]
> 2015-08-27 09:57:21,127 [Grizzly-worker(15)] DEBUG 
> org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Continue to wait for 
> connection on -1 if {} > 0
> 2015-08-27 09:57:21,127 [Grizzly-worker(15)] DEBUG 
> org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Timed-out waiting for 
> connection on Host{address=localhost/127.0.0.1:18182, 
> hostUri=ws://localhost:18182/gremlin} - possibly unavailable
> 2015-08-27 09:57:21,127 [gremlin-driver-worker-3] DEBUG 
> org.apache.tinkerpop.gremlin.driver.Host  - Trying to reconnect to dead host 
> at Host{address=localhost/127.0.0.1:18182, 
> hostUri=ws://localhost:18182/gremlin}
> 2015-08-27 09:57:21,128 [Grizzly-worker(15)] DEBUG 
> org.apache.tinkerpop.gremlin.driver.Client  - Submitted RequestMessage{, 
> requestId=67761973-7ae7-4a6a-8e62-443fdb55142d, op='eval', processor='', 
> args=<redacted>} to - connection not initialized
> 2015-08-27 09:57:21,128 [gremlin-driver-worker-3] DEBUG 
> org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Trying to re-establish 
> connection on Host{address=localhost/127.0.0.1:18182, 
> hostUri=ws://localhost:18182/gremlin}
> 2015-08-27 09:57:21,128 [Grizzly-worker(15)] ERROR 
> EdnaMetaDataServiceGremlinImpl  - Failed man lookupMerchantAccountName
> java.lang.RuntimeException: java.lang.RuntimeException: 
> java.util.concurrent.TimeoutException
> {noformat}
> This behavior continues with each subsequent request.  The gremlin-server is 
> available at all times.
> As a workaround I have applied the following patch, but I am pretty sure it 
> is not the correct solution
> {noformat}
> diff --git 
> a/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ConnectionPool.java
>  
> b/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ConnectionPool.java
> index 30cd71f..cb8ebfa 100644
> --- 
> a/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ConnectionPool.java
> +++ 
> b/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ConnectionPool.java
> @@ -352,7 +352,7 @@ final class ConnectionPool {
>              }
>              remaining = to - TimeUtil.timeSince(start, unit);
> -            logger.debug("Continue to wait for connection on {} if {} > 0", 
> remaining);
> +            logger.debug("Continue to wait for connection on {} if {} > 0", 
> host, remaining);
>          } while (remaining > 0);
>          logger.debug("Timed-out waiting for connection on {} - possibly 
> unavailable", host);
> @@ -371,13 +371,16 @@ final class ConnectionPool {
>          // let the load-balancer know that the host is acting poorly
>          this.cluster.loadBalancingStrategy().onUnavailable(host);
> -
>      }
>      private boolean tryReconnect(final Host h) {
>          logger.debug("Trying to re-establish connection on {}", host);
>          try {
> +            // remove dead connections
> +            for (Connection c : connections)
> +                if (c.isDead())
> +                    definitelyDestroyConnection(c);
>              connections.add(new Connection(host.getHostUri(), this, cluster, 
> settings().maxInProcessPerConnection));
>              this.open.set(connections.size());
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to