[ 
https://issues.apache.org/jira/browse/TINKERPOP-2169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16778594#comment-16778594
 ] 

stephen mallette commented on TINKERPOP-2169:
---------------------------------------------

Thanks for all that nice analysis of the problem. Based on what you're saying 
here I think that the first solution you suggested of considering the channel 
dead all together might make the most sense. I say that because there's no way 
to change the {{maxContentLength}} without recreating the {{Cluster}} object 
with the builder. No recovery is really possible given this situation. 

I assume that this is only a problem for the client-side {{maxContentLength}} 
right? if the server side {{maxContentLength}} is exceeded the client gets an 
error but future requests can still work if they are of the right size - we 
have a test that validates just that situation here:

https://github.com/apache/tinkerpop/blob/e81280836feed5df8d6adc8f3a03d09988282ea5/gremlin-server/src/test/java/org/apache/tinkerpop/gremlin/server/GremlinServerIntegrateTest.java#L1066

If that all makes sense to you, do you plan to submit a pull request [~yigitk] ?

> Responses exceeding maxContentLength cause subsequent queries to hang
> ---------------------------------------------------------------------
>
>                 Key: TINKERPOP-2169
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP-2169
>             Project: TinkerPop
>          Issue Type: Bug
>          Components: driver
>    Affects Versions: 3.4.0
>         Environment: Client: EC2 Amazon Linux t3.medium
> Server: Amazon Neptune r4.2xlarge
>            Reporter: Yigit Kiran
>            Priority: Major
>
> Gremlin Driver replaces connections on the channel when it receives 
> Exceptions that are instances of IOException or CodecException (including 
> CorruptedFrameException). When CorruptedFrameException is thrown because 
> response length is greater than the maxContentLength value (32kb by default), 
> driver thinks the host might be unavailable and tries to replace Connection.
> If Connection is shared among multiple requests (its pending queue is > 1), 
> other WSConnection goes stale after connection replacement, while keeping the 
> server executor threads busy.
> Keeping the exec threads busy for stale connections prevents server from 
> picking up new tasks for subsequent requests from the request queue. 
> Additionally since there is a new connection added in Client, it can accept 
> more requests and similar errors can lead to a build up in request queue. 
> When many concurrent requests gets into this situation server become 
> unresponsive to the new requests.
> h3. Steps to repro
>  
> 1. Have a gremlin server
> 2. Connect it using java driver with setting the maxContentLength pretty low, 
> i.e. using the config below:
>  
> {code:java}
> Cluster.Builder builder = Cluster.build();
>         builder.addContactPoint(endpoint[0]);
>         builder.port(8182);
>         builder.maxConnectionPoolSize(100);
>         builder.maxSimultaneousUsagePerConnection(100);
>         builder.maxInProcessPerConnection(50);
>         builder.maxContentLength(32); // <-- this is reduced from 32k 
>         builder.keepAliveInterval(0);
> {code}
>  
> 3. Issue concurrent requests using the cluster, where the response would be 
> greater than 32 bytes.
> h3. Ideas on a possible solution
> One possible solution to this is to not consider channel as dead when request 
> length exceeds maxContentFrame length. 
> {{}}
> {code:java}
> final class Connection {
>     ...
>     public ChannelPromise write(final RequestMessage requestMessage, final 
> CompletableFuture<ResultSet> future) {
>     ...
>         // FIX HERE: Do not consider CorruptedFrameException as 
> non-recoverable exception.
>         if ((t instanceof IOException || t instanceof CodecException) && (! 
> (t instanceof CorruptedFrameException))) {
>         ...
>         }
>     }
> }
> {code}
> Another fix could be the request can be deleted from the Connections' pending 
> request map, and if there are other pending requests on the connection, close 
> them before replacing the connection, or not replace the connection at all: 
>  
> {code:java}
> final class Connection {void replaceConnection(final Connection connection) {
>     ...
>     // FIX HERE: Do not replace connection if there are pending requests on 
> it. 
>     if (!connection.getPending().isEmpty()) {
>         return; // prevent replacing the connection while there are pending 
> requests.
>     }
>     considerNewConnection();
>     definitelyDestroyConnection(connection);
>     }
> }{code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to