[
https://issues.apache.org/jira/browse/TINKERPOP-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322246#comment-17322246
]
Florian Hockmann commented on TINKERPOP-2390:
---------------------------------------------
Thanks for the additional information! I just tried to reproduce the scenario
again. This time I tried both, JanusGraph and Gremlin Server with TinkerGraph,
again with a thread pool of 1. Instead of issuing queries that simply sleep
beyond the timeout, I have now initialized a graph with some data and then
executed a few Gremlin traversals in parallel that take each around 8 sec to
complete.
I could execute up to 5 of these traversals in parallel without any problems.
But if I tried it with 10 traversals in parallel, then the test didn't
terminate at all. The connections stayed in the state {{ESTABLISHED}} and no
error was logged, but nothing seemed to happen. However, this didn't block the
server at all. I could still execute traversals from another client on the same
server.
The behavior was exactly the same for JanusGraph and for Gremlin Server so this
is most likely caused by something in Gremlin Server.
To verify whether the Java driver handles this better, I tried the same from
Java, but the behavior was exactly the same. So, I still cannot reproduce the
case you described here where the problem is specific to Gremlin.Net and where
the server closes connections as the connections stay open and the behavior is
the same with the Java driver. But I still wonder why the server seems to be
completely stuck on these traversals. Maybe someone with more knowledge of
Gremlin Server could provide some insights here? [~spmallette] maybe?
----
Additional context for the test scenario:
The traversal I used for testing:
{code:java}
g.V().repeat(both()).times(3).path().limit(100000000).count().next();{code}
and the graph had 100 vertices where each vertex had one outgoing edge to each
vertex in the graph (so 10,000 edges in total).
I created [a branch|https://github.com/apache/tinkerpop/commits/TINKERPOP-2390]
where I pushed my test code. It's pretty rudimentary but should display what I
tested.
> Connections not released when closed abruptly in the server side
> ----------------------------------------------------------------
>
> Key: TINKERPOP-2390
> URL: https://issues.apache.org/jira/browse/TINKERPOP-2390
> Project: TinkerPop
> Issue Type: Bug
> Components: dotnet
> Affects Versions: 3.4.7
> Environment: Tinkerpop 3.4.7 + Janusgraph 0.5.1 (optional opencypher
> 1.0.0)
> Reporter: Carlos
> Priority: Major
>
> We have developed a WService to query a gremlin-server (JanusGraph 0.5.1)
> using the .net driver. Using the opencypher plugin has allowed us to see a
> behaviour where the server gets completely blocked after a timeout on the
> server side. We thought this might be related to issue
> https://issues.apache.org/jira/browse/TINKERPOP-2288, so we have moved our
> driver version to the master one (3.4-dev, which includes the PR solving this
> issue). However, when facing a timeout (server side always, it is the one
> launching the exception), quite a lot of connections get stalled at
> CLOSE_WAIT status, and the server becomes unusable.
> I've been digging around other bugs and issues, and from what I've read, some
> similar behaviour happened to CosmoDB (although it might be caused in that
> situation due to the some connection leaks, in this case is the timeout). We
> have traced down the problem to the driver itself after isolating all the
> components involved (optimizing the cypher query results in a non-timeout
> situation where everything is ok; forcing the timeout from pure gremlin
> replicates the behaviour).
> We have set up the connection pool params to 16 / 4096 (we are expecting
> quite a high concurrency load).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)