[
https://issues.apache.org/jira/browse/TINKERPOP-2169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16788393#comment-16788393
]
ASF GitHub Bot commented on TINKERPOP-2169:
-------------------------------------------
divijvaidya commented on pull request #1082: TINKERPOP-2169 Responses exceeding
maxContentLength cause subsequent queries to hang
URL: https://github.com/apache/tinkerpop/pull/1082
https://issues.apache.org/jira/browse/TINKERPOP-2169
https://issues.apache.org/jira/browse/TINKERPOP-2173
Similar pull request associated with master branch: #1081
This pull request corresponds to branch tp33.
**Problem**
1. On receiving a CorruptedFrameException, the current code does not
decrement the borrowed count on the connection and marks the connection for
destruction. However, the connection never gets destroyed since the it can only
be destroyed when the borrowed count is 0. This leads to unnecessary
connections waiting in the bin to be cleared.
2. If a connection is serving multiple requests and one of the requests gets
a CorruptedFrameException, Netty closes the underlying channel and thus, other
requests on the same channel receive a ChannelClosed exception. The current
code marks the server host as unavailable (thus closing out other connections
for the host as well) on a ChannelClosed exception whereas that is not
necessarily true. Marking the host as unavailable is too aggressive in this
scenario.
** Changes**
1. Connection.isDead() logic now consists of checking the underlying channel.
2. If a connection is dead, destroy the connection, irrespective of the
number of borrowed items. Informing the inflight (pending) request futures
about this is already done by the response handler.
3. If a connection is returned to the pool and it is dead, replace the
connection. Do not consider the host as unavailable since a single connection
dead does not imply a dead host.
4. Add a warning log whenever a host is marked as unavailable.
5. Replace the connection on ClosedChannel exception.
6. Reset the logging level in the integration tests after each test
execution.
**Testing**
1. Added a test to reproduce the leak (Problem#1). The test fails before the
fixes and passes after the fixes.
2. `mvn verify -DskipIntegrationTests=false` & `mvn verify install` is
successful for gremlin-server & gremlin-driver
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Responses exceeding maxContentLength cause subsequent queries to hang
> ---------------------------------------------------------------------
>
> Key: TINKERPOP-2169
> URL: https://issues.apache.org/jira/browse/TINKERPOP-2169
> Project: TinkerPop
> Issue Type: Bug
> Components: driver
> Affects Versions: 3.4.0
> Environment: Client: EC2 Amazon Linux t3.medium
> Server: Amazon Neptune r4.2xlarge
> Reporter: Yigit Kiran
> Priority: Major
>
> Gremlin Driver replaces connections on the channel when it receives
> Exceptions that are instances of IOException or CodecException (including
> CorruptedFrameException). When CorruptedFrameException is thrown because
> response length is greater than the maxContentLength value (32kb by default),
> driver thinks the host might be unavailable and tries to replace Connection.
> If Connection is shared among multiple requests (its pending queue is > 1),
> other WSConnection goes stale after connection replacement, while keeping the
> server executor threads busy.
> Keeping the exec threads busy for stale connections prevents server from
> picking up new tasks for subsequent requests from the request queue.
> Additionally since there is a new connection added in Client, it can accept
> more requests and similar errors can lead to a build up in request queue.
> When many concurrent requests gets into this situation server become
> unresponsive to the new requests.
> h3. Steps to repro
>
> 1. Have a gremlin server
> 2. Connect it using java driver with setting the maxContentLength pretty low,
> i.e. using the config below:
>
> {code:java}
> Cluster.Builder builder = Cluster.build();
> builder.addContactPoint(endpoint[0]);
> builder.port(8182);
> builder.maxConnectionPoolSize(100);
> builder.maxSimultaneousUsagePerConnection(100);
> builder.maxInProcessPerConnection(50);
> builder.maxContentLength(32); // <-- this is reduced from 32k
> builder.keepAliveInterval(0);
> {code}
>
> 3. Issue concurrent requests using the cluster, where the response would be
> greater than 32 bytes.
> h3. Ideas on a possible solution
> One possible solution to this is to not consider channel as dead when request
> length exceeds maxContentFrame length.
> {{}}
> {code:java}
> final class Connection {
> ...
> public ChannelPromise write(final RequestMessage requestMessage, final
> CompletableFuture<ResultSet> future) {
> ...
> // FIX HERE: Do not consider CorruptedFrameException as
> non-recoverable exception.
> if ((t instanceof IOException || t instanceof CodecException) && (!
> (t instanceof CorruptedFrameException))) {
> ...
> }
> }
> }
> {code}
> Another fix could be the request can be deleted from the Connections' pending
> request map, and if there are other pending requests on the connection, close
> them before replacing the connection, or not replace the connection at all:
>
> {code:java}
> final class Connection {void replaceConnection(final Connection connection) {
> ...
> // FIX HERE: Do not replace connection if there are pending requests on
> it.
> if (!connection.getPending().isEmpty()) {
> return; // prevent replacing the connection while there are pending
> requests.
> }
> considerNewConnection();
> definitelyDestroyConnection(connection);
> }
> }{code}
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)