Yigit Kiran created TINKERPOP-2169:
--------------------------------------

             Summary: Responses exceeding maxContentLength cause subsequent 
queries to hang
                 Key: TINKERPOP-2169
                 URL: https://issues.apache.org/jira/browse/TINKERPOP-2169
             Project: TinkerPop
          Issue Type: Bug
          Components: driver
    Affects Versions: 3.4.0
         Environment: Client: EC2 Amazon Linux t3.medium
Server: Amazon Neptune r4.2xlarge
            Reporter: Yigit Kiran


Gremlin Driver replaces connections on the channel when it receives Exceptions 
that are instances of IOException or CodecException (including 
CorruptedFrameException). When CorruptedFrameException is thrown because 
response length is greater than the maxContentLength value (32kb by default), 
driver thinks the host might be unavailable and tries to replace Connection.

If Connection is shared among multiple requests (its pending queue is > 1), 
other WSConnection goes stale after connection replacement, while keeping the 
server executor threads busy.

Keeping the exec threads busy for stale connections prevents server from 
picking up new tasks for subsequent requests from the request queue. 
Additionally since there is a new connection added in Client, it can accept 
more requests and similar errors can lead to a build up in request queue. When 
many concurrent requests gets into this situation server become unresponsive to 
the new requests.
h3. Steps to repro

 

1. Have a gremlin server

2. Connect it using java driver with setting the maxContentLength pretty low, 
i.e. using the config below:

 
{code:java}
Cluster.Builder builder = Cluster.build();
        builder.addContactPoint(endpoint[0]);
        builder.port(8182);
        builder.maxConnectionPoolSize(100);
        builder.maxSimultaneousUsagePerConnection(100);
        builder.maxInProcessPerConnection(50);
        builder.maxContentLength(32); // <-- this is reduced from 32k 
        builder.keepAliveInterval(0);
{code}
 

3. Issue concurrent requests using the cluster, where the response would be 
greater than 32 bytes.
h3. Ideas on a possible solution

One possible solution to this is to not consider channel as dead when request 
length exceeds maxContentFrame length. 
{{}}
{code:java}
final class Connection {
    ...
    public ChannelPromise write(final RequestMessage requestMessage, final 
CompletableFuture<ResultSet> future) {
    ...

        // FIX HERE: Do not consider CorruptedFrameException as non-recoverable 
exception.
        if ((t instanceof IOException || t instanceof CodecException) && (! (t 
instanceof CorruptedFrameException))) {
        ...
        }
    }
}
{code}
Another fix could be the request can be deleted from the Connections' pending 
request map, and if there are other pending requests on the connection, close 
them before replacing the connection, or not replace the connection at all: 

 
{code:java}
final class Connection {void replaceConnection(final Connection connection) {
    ...
    // FIX HERE: Do not replace connection if there are pending requests on it. 
    if (!connection.getPending().isEmpty()) {
        return; // prevent replacing the connection while there are pending 
requests.
    }

    considerNewConnection();
    definitelyDestroyConnection(connection);
    }
}{code}
 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to