[ 
https://issues.apache.org/jira/browse/TINKERPOP-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147561#comment-17147561
 ] 

Divij Vaidya edited comment on TINKERPOP-2369 at 8/4/20, 8:35 PM:
------------------------------------------------------------------

Hello [~carlsej] 

I was looking into this issue but I wasn't able to reproduce the behaviour that 
you mentioned in
??"The current implementation of the driver does not handle this situation 
well, as the Connection whose channel has been closed by the server remains in 
the ConnectionPool. The connection is only reported as dead and replaced when 
when it is later chosen by the LoadBalancingStrategy to server a client 
request, which inevitably fails when the connection attempts to write to the 
closed channel."??

On getting a CloseWebSocketFrame from the server, the channel *does not* remain 
in the ConnectionPool. Let me explain why:
When a server send a CloseWebSocketFrame on an existing connection, the 
following series of events happen.
1. The message is handled by the `WebSocketClientHandler` on the client, which 
[closes the 
channel|https://github.com/apache/tinkerpop/blob/HEAD/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/handler/WebSocketClientHandler.java#L90].
2. When the channel is closed, `channelInactive` method for all the handlers is 
called including for `GremlinResponseHandler`. In GremlinResponseHandler, on 
Channel Inactive, [we mark all existing requests on that channel as completed 
exceptionally|https://github.com/apache/tinkerpop/blob/HEAD/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/Handler.java#L207].
3. This would [call the handler exceptionally 
callback|https://github.com/apache/tinkerpop/blob/HEAD/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/Connection.java#L230]
 for `Connection` which would remove the connection from the connection pool 
and replace it.

Hence, the connection should be cleaned up on receiving a CloseWebSocketFrame 
from server. 

However, there might be a slight delay in cleaning up the connection. Since 
step#3 is asynchronous, it is possible that another submit request in a 
separate worker thread picks up this closed connection while it is yet to be 
removed from connectionpool. In that case, the write to the server would fail. 
Currently [we do not retry another 
connection|https://github.com/apache/tinkerpop/blob/HEAD/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/Client.java#L367]
 when write to server has failed. This might trigger the behaviour that you 
mentioned.

One change that we could make is to keep trying out different connections first 
from the same connection pool and then from separate hosts on failing to write 
to a connection. That should address your concerns. What do you think of this 
approach?


 
 


was (Author: divijvaidya):
Hello [~carlsej] 

I was looking into this issue but I wasn't able to reproduce the behaviour that 
you mentioned in
??"The current implementation of the driver does not handle this situation 
well, as the Connection whose channel has been closed by the server remains in 
the ConnectionPool. The connection is only reported as dead and replaced when 
when it is later chosen by the LoadBalancingStrategy to server a client 
request, which inevitably fails when the connection attempts to write to the 
closed channel."??

On getting a CloseWebSocketFrame from the server, the channel *does not* remain 
in the ConnectionPool. Let me explain why:
When a server send a CloseWebSocketFrame on an existing connection, the 
following series of events happen.
1. The message is handled by the `WebSocketClientHandler` on the client, which 
[closes the 
channel|[https://github.com/apache/tinkerpop/blob/HEAD/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/handler/WebSocketClientHandler.java#L90]].
2. When the channel is closed, `channelInactive` method for all the handlers is 
called including for `GremlinResponseHandler`. In GremlinResponseHandler, on 
Channel Inactive, [we mark all existing requests on that channel as completed 
exceptionally|[https://github.com/apache/tinkerpop/blob/HEAD/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/Handler.java#L207]].
3. This would [call the handler exceptionally 
callback|[https://github.com/apache/tinkerpop/blob/HEAD/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/Connection.java#L230]]
 for `Connection` which would remove the connection from the connection pool 
and replace it.

Hence, the connection should be cleaned up on receiving a CloseWebSocketFrame 
from server. 

However, there might be a slight delay in cleaning up the connection. Since 
step#3 is asynchronous, it is possible that another submit request in a 
separate worker thread picks up this closed connection while it is yet to be 
removed from connectionpool. In that case, the write to the server would fail. 
Currently [we do not retry another 
connection|[https://github.com/apache/tinkerpop/blob/HEAD/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/Client.java#L367]]
 when write to server has failed. This might trigger the behaviour that you 
mentioned.

One change that we could make is to keep trying out different connections first 
from the same connection pool and then from separate hosts on failing to write 
to a connection. That should address your concerns. What do you think of this 
approach?


 
 

> Connections in ConnectionPool are not replaced in background when underlying 
> channel is closed
> ----------------------------------------------------------------------------------------------
>
>                 Key: TINKERPOP-2369
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP-2369
>             Project: TinkerPop
>          Issue Type: Bug
>          Components: driver
>    Affects Versions: 3.4.1
>            Reporter: Johannes Carlsen
>            Priority: Major
>
> Hi Tinkerpop team!
>  
> We are using the Gremlin Java Driver to connect to an Amazon Neptune cluster. 
> We are using the IAM authentication feature provided by Neptune, which means 
> that individual websocket connections are closed by the server every 36 
> hours, when their credentials expire. The current implementation of the 
> driver does not handle this situation well, as the Connection whose channel 
> has been closed by the server remains in the ConnectionPool. The connection 
> is only reported as dead and replaced when when it is later chosen by the 
> LoadBalancingStrategy to server a client request, which inevitably fails when 
> the connection attempts to write to the closed channel.
> A fix for this bug would cause the connection pool to be automatically 
> refreshed in the background by either the keep-alive mechanism, which should 
> replace a connection if a keep-alive request fails, or by adding a listener 
> for the close frame being sent to the underlying channel to replace the 
> connection. Without a fix, the only way to recover from a stale connection is 
> to retry the request at the cluster level, which will allow the request to be 
> directed to a different connection.
> I noticed a PR out for the .NET client to fix this behavior: 
> [https://github.com/apache/tinkerpop/pull/1279.] We are hoping for something 
> similar in the Gremlin Java Driver.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to