[jira] [Commented] (TINKERPOP-2390) Connections not released when closed abruptly in the server side
[ https://issues.apache.org/jira/browse/TINKERPOP-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17322246#comment-17322246 ] Florian Hockmann commented on TINKERPOP-2390: - Thanks for the additional information! I just tried to reproduce the scenario again. This time I tried both, JanusGraph and Gremlin Server with TinkerGraph, again with a thread pool of 1. Instead of issuing queries that simply sleep beyond the timeout, I have now initialized a graph with some data and then executed a few Gremlin traversals in parallel that take each around 8 sec to complete. I could execute up to 5 of these traversals in parallel without any problems. But if I tried it with 10 traversals in parallel, then the test didn't terminate at all. The connections stayed in the state {{ESTABLISHED}} and no error was logged, but nothing seemed to happen. However, this didn't block the server at all. I could still execute traversals from another client on the same server. The behavior was exactly the same for JanusGraph and for Gremlin Server so this is most likely caused by something in Gremlin Server. To verify whether the Java driver handles this better, I tried the same from Java, but the behavior was exactly the same. So, I still cannot reproduce the case you described here where the problem is specific to Gremlin.Net and where the server closes connections as the connections stay open and the behavior is the same with the Java driver. But I still wonder why the server seems to be completely stuck on these traversals. Maybe someone with more knowledge of Gremlin Server could provide some insights here? [~spmallette] maybe? Additional context for the test scenario: The traversal I used for testing: {code:java} g.V().repeat(both()).times(3).path().limit(1).count().next();{code} and the graph had 100 vertices where each vertex had one outgoing edge to each vertex in the graph (so 10,000 edges in total). I created [a branch|https://github.com/apache/tinkerpop/commits/TINKERPOP-2390] where I pushed my test code. It's pretty rudimentary but should display what I tested. > Connections not released when closed abruptly in the server side > > > Key: TINKERPOP-2390 > URL: https://issues.apache.org/jira/browse/TINKERPOP-2390 > Project: TinkerPop > Issue Type: Bug > Components: dotnet >Affects Versions: 3.4.7 > Environment: Tinkerpop 3.4.7 + Janusgraph 0.5.1 (optional opencypher > 1.0.0) >Reporter: Carlos >Priority: Major > > We have developed a WService to query a gremlin-server (JanusGraph 0.5.1) > using the .net driver. Using the opencypher plugin has allowed us to see a > behaviour where the server gets completely blocked after a timeout on the > server side. We thought this might be related to issue > https://issues.apache.org/jira/browse/TINKERPOP-2288, so we have moved our > driver version to the master one (3.4-dev, which includes the PR solving this > issue). However, when facing a timeout (server side always, it is the one > launching the exception), quite a lot of connections get stalled at > CLOSE_WAIT status, and the server becomes unusable. > I've been digging around other bugs and issues, and from what I've read, some > similar behaviour happened to CosmoDB (although it might be caused in that > situation due to the some connection leaks, in this case is the timeout). We > have traced down the problem to the driver itself after isolating all the > components involved (optimizing the cypher query results in a non-timeout > situation where everything is ok; forcing the timeout from pure gremlin > replicates the behaviour). > We have set up the connection pool params to 16 / 4096 (we are expecting > quite a high concurrency load). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TINKERPOP-2390) Connections not released when closed abruptly in the server side
[ https://issues.apache.org/jira/browse/TINKERPOP-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321134#comment-17321134 ] Carlos commented on TINKERPOP-2390: --- Hi Florian, we moved our implementation to the Java driver. However, the scenario where we saw this behaviour was not exactly due to timeout, but due to overwhelming the server with petitions (not so many in fact). When the underlying Tinkerpop provider was saturated (janusgraph in this case), seemed to start closing connections and the Tinkerpop server was not notifying properly the client, that's why we saw a lot of dead connections. I'm afraid I cannot give more details rather than it was a continuous flow of requests that provoked the behaviour. Best, > Connections not released when closed abruptly in the server side > > > Key: TINKERPOP-2390 > URL: https://issues.apache.org/jira/browse/TINKERPOP-2390 > Project: TinkerPop > Issue Type: Bug > Components: dotnet >Affects Versions: 3.4.7 > Environment: Tinkerpop 3.4.7 + Janusgraph 0.5.1 (optional opencypher > 1.0.0) >Reporter: Carlos >Priority: Major > > We have developed a WService to query a gremlin-server (JanusGraph 0.5.1) > using the .net driver. Using the opencypher plugin has allowed us to see a > behaviour where the server gets completely blocked after a timeout on the > server side. We thought this might be related to issue > https://issues.apache.org/jira/browse/TINKERPOP-2288, so we have moved our > driver version to the master one (3.4-dev, which includes the PR solving this > issue). However, when facing a timeout (server side always, it is the one > launching the exception), quite a lot of connections get stalled at > CLOSE_WAIT status, and the server becomes unusable. > I've been digging around other bugs and issues, and from what I've read, some > similar behaviour happened to CosmoDB (although it might be caused in that > situation due to the some connection leaks, in this case is the timeout). We > have traced down the problem to the driver itself after isolating all the > components involved (optimizing the cypher query results in a non-timeout > situation where everything is ok; forcing the timeout from pure gremlin > replicates the behaviour). > We have set up the connection pool params to 16 / 4096 (we are expecting > quite a high concurrency load). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TINKERPOP-2390) Connections not released when closed abruptly in the server side
[ https://issues.apache.org/jira/browse/TINKERPOP-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321093#comment-17321093 ] Florian Hockmann commented on TINKERPOP-2390: - I just tried to reproduce the scenario but I don't see anything wrong. Here is what I did: # Start the server with a {{gremlinPool}} of 1 as described above (_TinkerpopServer configured not to provide any concurrent service (i.e., all the queries were processed sequentially_). # Connect from Gremlin.Net (I used the version from current {{master}} and also tried it with the version from {{3.4-dev}}) with default settings ({{PoolSize}} of 4 and {{MaxInProcessPerConnection}}: 32) # Send 10 requests with a custom evaluation timeout of 1 ms that simply sleep for 3 seconds. # Result: ## All requests get a {{ResponseException}} with a timeout on the server side. ## 4 connections in state {{ESTABLISHED}} on the server side. # Send 1 request to verify that both the driver and the server are still in a valid state. -> Receive the expected result. # Dispose the {{GremlinClient}} instance. # Result: ## All 4 connections in state {{TIME_WAIT}} on the server ## After 1 min: connections completely closed The server is still responsive after this. The {{TIME_WAIT}} is expected from my limited knowledge about TCP as connections are not completely closed immediately in case a packet is received out of order. But they are closed after a timeout which seems to be one minute on my machine. What I really don't understand here is why the server should close the connection just because one request ran into a timeout. That doesn't make much sense as multiple requests can be processed on the same connection. So, the connection shouldn't be affected by a failing request (failing here in the sense of timing out). [~Bobed] Could you please provide more information on this, ideally a setup to reproduce the problem deterministically? Otherwise, I'm inclined to close this issue as we cannot reproduce it. > Connections not released when closed abruptly in the server side > > > Key: TINKERPOP-2390 > URL: https://issues.apache.org/jira/browse/TINKERPOP-2390 > Project: TinkerPop > Issue Type: Bug > Components: dotnet >Affects Versions: 3.4.7 > Environment: Tinkerpop 3.4.7 + Janusgraph 0.5.1 (optional opencypher > 1.0.0) >Reporter: Carlos >Priority: Major > > We have developed a WService to query a gremlin-server (JanusGraph 0.5.1) > using the .net driver. Using the opencypher plugin has allowed us to see a > behaviour where the server gets completely blocked after a timeout on the > server side. We thought this might be related to issue > https://issues.apache.org/jira/browse/TINKERPOP-2288, so we have moved our > driver version to the master one (3.4-dev, which includes the PR solving this > issue). However, when facing a timeout (server side always, it is the one > launching the exception), quite a lot of connections get stalled at > CLOSE_WAIT status, and the server becomes unusable. > I've been digging around other bugs and issues, and from what I've read, some > similar behaviour happened to CosmoDB (although it might be caused in that > situation due to the some connection leaks, in this case is the timeout). We > have traced down the problem to the driver itself after isolating all the > components involved (optimizing the cypher query results in a non-timeout > situation where everything is ok; forcing the timeout from pure gremlin > replicates the behaviour). > We have set up the connection pool params to 16 / 4096 (we are expecting > quite a high concurrency load). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TINKERPOP-2390) Connections not released when closed abruptly in the server side
[ https://issues.apache.org/jira/browse/TINKERPOP-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17237258#comment-17237258 ] Carlos commented on TINKERPOP-2390: --- Hi Florian, > Your REST layer is not sitting between Gremlin.NET and the server, but using >Gremlin.NET to provide access to the server for clients via REST, right? yes, that's it. The REST API uses Gremlin.NET to pose the queries to Janusgraph. {quote}and check the nestat of the client machine {quote} >Here you mean the client of your REST service? And you mean {{netstat}}, >right? Could you provide the exact command you used here? Sorry for the ambiguity, here client would be the machine running the server providing the REST API (external client => server with the REST API * => Janusgraph). In the *-machine, it was netstat the command used (typo), but I cannot recall exacltly which netstat options we used (I assume that netstat -t tcp to check the status of all the tcp connections and check why it was stalled). Finally, yes, it should not be so difficult to reproduce. Exactly as you wrote, posing some queries that will run into a timeout will do the trick. Another possible detail to bear in mind would be that we had the TinkerpopServer configured not to provide any concurrent service (i.e., all the queries were processed sequentially). I forgot to mention this as with the Java driver it didn't seem to make a difference (but just in case it could be relevant for reproducibility purposes). Best, > Connections not released when closed abruptly in the server side > > > Key: TINKERPOP-2390 > URL: https://issues.apache.org/jira/browse/TINKERPOP-2390 > Project: TinkerPop > Issue Type: Bug > Components: dotnet >Affects Versions: 3.4.7 > Environment: Tinkerpop 3.4.7 + Janusgraph 0.5.1 (optional opencypher > 1.0.0) >Reporter: Carlos >Priority: Major > > We have developed a WService to query a gremlin-server (JanusGraph 0.5.1) > using the .net driver. Using the opencypher plugin has allowed us to see a > behaviour where the server gets completely blocked after a timeout on the > server side. We thought this might be related to issue > https://issues.apache.org/jira/browse/TINKERPOP-2288, so we have moved our > driver version to the master one (3.4-dev, which includes the PR solving this > issue). However, when facing a timeout (server side always, it is the one > launching the exception), quite a lot of connections get stalled at > CLOSE_WAIT status, and the server becomes unusable. > I've been digging around other bugs and issues, and from what I've read, some > similar behaviour happened to CosmoDB (although it might be caused in that > situation due to the some connection leaks, in this case is the timeout). We > have traced down the problem to the driver itself after isolating all the > components involved (optimizing the cypher query results in a non-timeout > situation where everything is ok; forcing the timeout from pure gremlin > replicates the behaviour). > We have set up the connection pool params to 16 / 4096 (we are expecting > quite a high concurrency load). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TINKERPOP-2390) Connections not released when closed abruptly in the server side
[ https://issues.apache.org/jira/browse/TINKERPOP-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17237253#comment-17237253 ] Florian Hockmann commented on TINKERPOP-2390: - Thanks for getting back on this. I'm trying to understand your setup and have a few questions on that: Your REST layer is not sitting between Gremlin.NET and the server, but using Gremlin.NET to provide access to the server for clients via REST, right? So, the REST facade you mention is only necessary to have the setup completely running, including your REST service and its client. Or is the REST facade actually between Gremlin.NET and the server somehow? (Which would mean that it needs to support Websockets though.) Now about this part: {quote}and check the nestat of the client machine {quote} Here you mean the client of your REST service? And you mean {{netstat}}, right? Could you provide the exact command you used here? If I understood your description correctly, then it should be possible to reproduce this without the REST component by just sending some queries that will run into a timeout. > Connections not released when closed abruptly in the server side > > > Key: TINKERPOP-2390 > URL: https://issues.apache.org/jira/browse/TINKERPOP-2390 > Project: TinkerPop > Issue Type: Bug > Components: dotnet >Affects Versions: 3.4.7 > Environment: Tinkerpop 3.4.7 + Janusgraph 0.5.1 (optional opencypher > 1.0.0) >Reporter: Carlos >Priority: Major > > We have developed a WService to query a gremlin-server (JanusGraph 0.5.1) > using the .net driver. Using the opencypher plugin has allowed us to see a > behaviour where the server gets completely blocked after a timeout on the > server side. We thought this might be related to issue > https://issues.apache.org/jira/browse/TINKERPOP-2288, so we have moved our > driver version to the master one (3.4-dev, which includes the PR solving this > issue). However, when facing a timeout (server side always, it is the one > launching the exception), quite a lot of connections get stalled at > CLOSE_WAIT status, and the server becomes unusable. > I've been digging around other bugs and issues, and from what I've read, some > similar behaviour happened to CosmoDB (although it might be caused in that > situation due to the some connection leaks, in this case is the timeout). We > have traced down the problem to the driver itself after isolating all the > components involved (optimizing the cypher query results in a non-timeout > situation where everything is ok; forcing the timeout from pure gremlin > replicates the behaviour). > We have set up the connection pool params to 16 / 4096 (we are expecting > quite a high concurrency load). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TINKERPOP-2390) Connections not released when closed abruptly in the server side
[ https://issues.apache.org/jira/browse/TINKERPOP-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17236873#comment-17236873 ] Carlos commented on TINKERPOP-2390: --- On the server side, the only exception we saw was a script timeout exception (we had set up the script timeout to 30s, and the query was longer). We built a REST layer on top of the Gremlin.Net driver and the behaviour we witnessed was that the connections stucked in TIME_WAIT instead of CLOSE_WAIT (the local side wasn't aware of the fact that the server had closed the connection ... this lead to a driver completely blocked when several long queries gave the timeout exception). We had no other choice than restarting the VM (we tried deploying pure VMs and pods). Regarding the new version, I'm afraid I cannot answer for sure as we ended up moving to the Java driver. I recall doing some tests with the latest version just before moving to the other driver (we used the code from the branch including that fix) and it still happen, but I cannot state that we tested 3.4.8 version for sure. The behaviour is quite easy to reproduce though, just send several long queries to the driver (setting up a short script timeout on the server side) and check the nestat of the client machine (the setup should be a REST fassade, in order to keep the client alive). We discarded a problem at Janusgraph side as with the Java driver, we didn't have this problem (I assume that the Gremlin Driver abstracts from the actual underlying provider). Best, > Connections not released when closed abruptly in the server side > > > Key: TINKERPOP-2390 > URL: https://issues.apache.org/jira/browse/TINKERPOP-2390 > Project: TinkerPop > Issue Type: Bug > Components: dotnet >Affects Versions: 3.4.7 > Environment: Tinkerpop 3.4.7 + Janusgraph 0.5.1 (optional opencypher > 1.0.0) >Reporter: Carlos >Priority: Major > > We have developed a WService to query a gremlin-server (JanusGraph 0.5.1) > using the .net driver. Using the opencypher plugin has allowed us to see a > behaviour where the server gets completely blocked after a timeout on the > server side. We thought this might be related to issue > https://issues.apache.org/jira/browse/TINKERPOP-2288, so we have moved our > driver version to the master one (3.4-dev, which includes the PR solving this > issue). However, when facing a timeout (server side always, it is the one > launching the exception), quite a lot of connections get stalled at > CLOSE_WAIT status, and the server becomes unusable. > I've been digging around other bugs and issues, and from what I've read, some > similar behaviour happened to CosmoDB (although it might be caused in that > situation due to the some connection leaks, in this case is the timeout). We > have traced down the problem to the driver itself after isolating all the > components involved (optimizing the cypher query results in a non-timeout > situation where everything is ok; forcing the timeout from pure gremlin > replicates the behaviour). > We have set up the connection pool params to 16 / 4096 (we are expecting > quite a high concurrency load). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TINKERPOP-2390) Connections not released when closed abruptly in the server side
[ https://issues.apache.org/jira/browse/TINKERPOP-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17229988#comment-17229988 ] Florian Hockmann commented on TINKERPOP-2390: - Can you show the exception you see on the server? And what is the behavior that you see in the Gremlin.Net driver at that point? Also, could you please try whether the issue still occurs in Gremlin.Net 3.4.8 as that includes some changes to the driver that might be relevant here? > Connections not released when closed abruptly in the server side > > > Key: TINKERPOP-2390 > URL: https://issues.apache.org/jira/browse/TINKERPOP-2390 > Project: TinkerPop > Issue Type: Bug > Components: dotnet >Affects Versions: 3.4.7 > Environment: Tinkerpop 3.4.7 + Janusgraph 0.5.1 (optional opencypher > 1.0.0) >Reporter: Carlos >Priority: Major > > We have developed a WService to query a gremlin-server (JanusGraph 0.5.1) > using the .net driver. Using the opencypher plugin has allowed us to see a > behaviour where the server gets completely blocked after a timeout on the > server side. We thought this might be related to issue > https://issues.apache.org/jira/browse/TINKERPOP-2288, so we have moved our > driver version to the master one (3.4-dev, which includes the PR solving this > issue). However, when facing a timeout (server side always, it is the one > launching the exception), quite a lot of connections get stalled at > CLOSE_WAIT status, and the server becomes unusable. > I've been digging around other bugs and issues, and from what I've read, some > similar behaviour happened to CosmoDB (although it might be caused in that > situation due to the some connection leaks, in this case is the timeout). We > have traced down the problem to the driver itself after isolating all the > components involved (optimizing the cypher query results in a non-timeout > situation where everything is ok; forcing the timeout from pure gremlin > replicates the behaviour). > We have set up the connection pool params to 16 / 4096 (we are expecting > quite a high concurrency load). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TINKERPOP-2390) Connections not released when closed abruptly in the server side
[ https://issues.apache.org/jira/browse/TINKERPOP-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152504#comment-17152504 ] Carlos commented on TINKERPOP-2390: --- >From what I've read so far, it might be related to issue 2369 > Connections not released when closed abruptly in the server side > > > Key: TINKERPOP-2390 > URL: https://issues.apache.org/jira/browse/TINKERPOP-2390 > Project: TinkerPop > Issue Type: Bug > Components: dotnet >Affects Versions: 3.4.7 > Environment: Tinkerpop 3.4.7 + Janusgraph 0.5.1 (optional opencypher > 1.0.0) >Reporter: Carlos >Priority: Major > > We have developed a WService to query a gremlin-server (JanusGraph 0.5.1) > using the .net driver. Using the opencypher plugin has allowed us to see a > behaviour where the server gets completely blocked after a timeout on the > server side. We thought this might be related to issue > https://issues.apache.org/jira/browse/TINKERPOP-2288, so we have moved our > driver version to the master one (3.4-dev, which includes the PR solving this > issue). However, when facing a timeout (server side always, it is the one > launching the exception), quite a lot of connections get stalled at > CLOSE_WAIT status, and the server becomes unusable. > I've been digging around other bugs and issues, and from what I've read, some > similar behaviour happened to CosmoDB (although it might be caused in that > situation due to the some connection leaks, in this case is the timeout). We > have traced down the problem to the driver itself after isolating all the > components involved (optimizing the cypher query results in a non-timeout > situation where everything is ok; forcing the timeout from pure gremlin > replicates the behaviour). > We have set up the connection pool params to 16 / 4096 (we are expecting > quite a high concurrency load). -- This message was sent by Atlassian Jira (v8.3.4#803005)