[
https://issues.apache.org/jira/browse/TINKERPOP-2169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781642#comment-16781642
]
stephen mallette commented on TINKERPOP-2169:
---------------------------------------------
Again, great analysis of these issues. There are a number of subtle issues
you've identified here. Since you're interested in submitting a pull request, I
think that you should not try to fix all of these things at once. Let's focus
on the main issue of this ticket and then you can create other tickets for the
other issues that you've noted.
Speaking specifically about this ticket, I think I may have been hasty in
suggesting that we mark the channel as dead. If the user decides to set a
{{maxContentLength}} that is exceeded it may simply mean that they don't want
the "too large" message to go through and that it's ok if it doesn't. The user
should just get an exception for that message and be able to continue to submit
smaller messages that fall under client-side {{maxContentLength}}. Does that
make sense to you?
As for your new points in your recent comment, I'll just quickly offer some
thoughts (keeping in mind that at this point you might have a better handle on
what's happening in there than I do since it's been a while since i've worked
with that code in any detail):
1. regarding the "Connection Borrowing Logic", it does appear as if we're not
decrementing that number in all the cases that we should (i.e. exceptions). i
don't immediately know what the best fix is
2. regarding your third bullet in the comment, i think i resorted to the safety
of marking the host as bad a bit too often and easily. in retrospect, i
probably felt like it was a clean way to deal with errors because host retry
logic would kick in and reset everything. i probably further felt like that
approach was temporary and that i would return to it later to improve it. that
didn't happen obviously. if you can improve that in any way so that the driver
stays more resilient and we save the expense of retrying the host and
recreating the channel, that would be excellent.
3. Your third link points to {{TraversalOpProcessor}} - don't forget the script
based processors as well if you choose that server approach you suggested there.
Please target the {{tp33}} branch with your changes for your pull request.
Hopefully you can see some nice ways to divide this work into smaller bits
(i.e. other JIRA issues) so that they are easier to evaluate/review in
isolation. Please let me know if you have any other questions. Thanks for
digging so deeply into this btw - not many people poke around this part of the
code too much. Looking forward to seeing what you come up with.
> Responses exceeding maxContentLength cause subsequent queries to hang
> ---------------------------------------------------------------------
>
> Key: TINKERPOP-2169
> URL: https://issues.apache.org/jira/browse/TINKERPOP-2169
> Project: TinkerPop
> Issue Type: Bug
> Components: driver
> Affects Versions: 3.4.0
> Environment: Client: EC2 Amazon Linux t3.medium
> Server: Amazon Neptune r4.2xlarge
> Reporter: Yigit Kiran
> Priority: Major
>
> Gremlin Driver replaces connections on the channel when it receives
> Exceptions that are instances of IOException or CodecException (including
> CorruptedFrameException). When CorruptedFrameException is thrown because
> response length is greater than the maxContentLength value (32kb by default),
> driver thinks the host might be unavailable and tries to replace Connection.
> If Connection is shared among multiple requests (its pending queue is > 1),
> other WSConnection goes stale after connection replacement, while keeping the
> server executor threads busy.
> Keeping the exec threads busy for stale connections prevents server from
> picking up new tasks for subsequent requests from the request queue.
> Additionally since there is a new connection added in Client, it can accept
> more requests and similar errors can lead to a build up in request queue.
> When many concurrent requests gets into this situation server become
> unresponsive to the new requests.
> h3. Steps to repro
>
> 1. Have a gremlin server
> 2. Connect it using java driver with setting the maxContentLength pretty low,
> i.e. using the config below:
>
> {code:java}
> Cluster.Builder builder = Cluster.build();
> builder.addContactPoint(endpoint[0]);
> builder.port(8182);
> builder.maxConnectionPoolSize(100);
> builder.maxSimultaneousUsagePerConnection(100);
> builder.maxInProcessPerConnection(50);
> builder.maxContentLength(32); // <-- this is reduced from 32k
> builder.keepAliveInterval(0);
> {code}
>
> 3. Issue concurrent requests using the cluster, where the response would be
> greater than 32 bytes.
> h3. Ideas on a possible solution
> One possible solution to this is to not consider channel as dead when request
> length exceeds maxContentFrame length.
> {{}}
> {code:java}
> final class Connection {
> ...
> public ChannelPromise write(final RequestMessage requestMessage, final
> CompletableFuture<ResultSet> future) {
> ...
> // FIX HERE: Do not consider CorruptedFrameException as
> non-recoverable exception.
> if ((t instanceof IOException || t instanceof CodecException) && (!
> (t instanceof CorruptedFrameException))) {
> ...
> }
> }
> }
> {code}
> Another fix could be the request can be deleted from the Connections' pending
> request map, and if there are other pending requests on the connection, close
> them before replacing the connection, or not replace the connection at all:
>
> {code:java}
> final class Connection {void replaceConnection(final Connection connection) {
> ...
> // FIX HERE: Do not replace connection if there are pending requests on
> it.
> if (!connection.getPending().isEmpty()) {
> return; // prevent replacing the connection while there are pending
> requests.
> }
> considerNewConnection();
> definitelyDestroyConnection(connection);
> }
> }{code}
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)