[
https://issues.apache.org/jira/browse/TINKERPOP-2169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16778594#comment-16778594
]
stephen mallette commented on TINKERPOP-2169:
---------------------------------------------
Thanks for all that nice analysis of the problem. Based on what you're saying
here I think that the first solution you suggested of considering the channel
dead all together might make the most sense. I say that because there's no way
to change the {{maxContentLength}} without recreating the {{Cluster}} object
with the builder. No recovery is really possible given this situation.
I assume that this is only a problem for the client-side {{maxContentLength}}
right? if the server side {{maxContentLength}} is exceeded the client gets an
error but future requests can still work if they are of the right size - we
have a test that validates just that situation here:
https://github.com/apache/tinkerpop/blob/e81280836feed5df8d6adc8f3a03d09988282ea5/gremlin-server/src/test/java/org/apache/tinkerpop/gremlin/server/GremlinServerIntegrateTest.java#L1066
If that all makes sense to you, do you plan to submit a pull request [~yigitk] ?
> Responses exceeding maxContentLength cause subsequent queries to hang
> ---------------------------------------------------------------------
>
> Key: TINKERPOP-2169
> URL: https://issues.apache.org/jira/browse/TINKERPOP-2169
> Project: TinkerPop
> Issue Type: Bug
> Components: driver
> Affects Versions: 3.4.0
> Environment: Client: EC2 Amazon Linux t3.medium
> Server: Amazon Neptune r4.2xlarge
> Reporter: Yigit Kiran
> Priority: Major
>
> Gremlin Driver replaces connections on the channel when it receives
> Exceptions that are instances of IOException or CodecException (including
> CorruptedFrameException). When CorruptedFrameException is thrown because
> response length is greater than the maxContentLength value (32kb by default),
> driver thinks the host might be unavailable and tries to replace Connection.
> If Connection is shared among multiple requests (its pending queue is > 1),
> other WSConnection goes stale after connection replacement, while keeping the
> server executor threads busy.
> Keeping the exec threads busy for stale connections prevents server from
> picking up new tasks for subsequent requests from the request queue.
> Additionally since there is a new connection added in Client, it can accept
> more requests and similar errors can lead to a build up in request queue.
> When many concurrent requests gets into this situation server become
> unresponsive to the new requests.
> h3. Steps to repro
>
> 1. Have a gremlin server
> 2. Connect it using java driver with setting the maxContentLength pretty low,
> i.e. using the config below:
>
> {code:java}
> Cluster.Builder builder = Cluster.build();
> builder.addContactPoint(endpoint[0]);
> builder.port(8182);
> builder.maxConnectionPoolSize(100);
> builder.maxSimultaneousUsagePerConnection(100);
> builder.maxInProcessPerConnection(50);
> builder.maxContentLength(32); // <-- this is reduced from 32k
> builder.keepAliveInterval(0);
> {code}
>
> 3. Issue concurrent requests using the cluster, where the response would be
> greater than 32 bytes.
> h3. Ideas on a possible solution
> One possible solution to this is to not consider channel as dead when request
> length exceeds maxContentFrame length.
> {{}}
> {code:java}
> final class Connection {
> ...
> public ChannelPromise write(final RequestMessage requestMessage, final
> CompletableFuture<ResultSet> future) {
> ...
> // FIX HERE: Do not consider CorruptedFrameException as
> non-recoverable exception.
> if ((t instanceof IOException || t instanceof CodecException) && (!
> (t instanceof CorruptedFrameException))) {
> ...
> }
> }
> }
> {code}
> Another fix could be the request can be deleted from the Connections' pending
> request map, and if there are other pending requests on the connection, close
> them before replacing the connection, or not replace the connection at all:
>
> {code:java}
> final class Connection {void replaceConnection(final Connection connection) {
> ...
> // FIX HERE: Do not replace connection if there are pending requests on
> it.
> if (!connection.getPending().isEmpty()) {
> return; // prevent replacing the connection while there are pending
> requests.
> }
> considerNewConnection();
> definitelyDestroyConnection(connection);
> }
> }{code}
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)