[jira] [Commented] (TINKERPOP-2390) Connections not released when closed abruptly in the server side

2021-04-15 Thread Florian Hockmann (Jira)


[ 
https://issues.apache.org/jira/browse/TINKERPOP-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17322246#comment-17322246
 ] 

Florian Hockmann commented on TINKERPOP-2390:
-

Thanks for the additional information! I just tried to reproduce the scenario 
again. This time I tried both, JanusGraph and Gremlin Server with TinkerGraph, 
again with a thread pool of 1. Instead of issuing queries that simply sleep 
beyond the timeout, I have now initialized a graph with some data and then 
executed a few Gremlin traversals in parallel that take each around 8 sec to 
complete.

I could execute up to 5 of these traversals in parallel without any problems. 
But if I tried it with 10 traversals in parallel, then the test didn't 
terminate at all. The connections stayed in the state {{ESTABLISHED}} and no 
error was logged, but nothing seemed to happen. However, this didn't block the 
server at all. I could still execute traversals from another client on the same 
server.

The behavior was exactly the same for JanusGraph and for Gremlin Server so this 
is most likely caused by something in Gremlin Server.

To verify whether the Java driver handles this better, I tried the same from 
Java, but the behavior was exactly the same. So, I still cannot reproduce the 
case you described here where the problem is specific to Gremlin.Net and where 
the server closes connections as the connections stay open and the behavior is 
the same with the Java driver. But I still wonder why the server seems to be 
completely stuck on these traversals. Maybe someone with more knowledge of 
Gremlin Server could provide some insights here? [~spmallette] maybe?

 

Additional context for the test scenario:

The traversal I used for testing:
{code:java}
 g.V().repeat(both()).times(3).path().limit(1).count().next();{code}
and the graph had 100 vertices where each vertex had one outgoing edge to each 
vertex in the graph (so 10,000 edges in total).

I created [a branch|https://github.com/apache/tinkerpop/commits/TINKERPOP-2390] 
where I pushed my test code. It's pretty rudimentary but should display what I 
tested.

> Connections not released when closed abruptly in the server side
> 
>
> Key: TINKERPOP-2390
> URL: https://issues.apache.org/jira/browse/TINKERPOP-2390
> Project: TinkerPop
>  Issue Type: Bug
>  Components: dotnet
>Affects Versions: 3.4.7
> Environment: Tinkerpop 3.4.7 + Janusgraph 0.5.1 (optional opencypher 
> 1.0.0) 
>Reporter: Carlos
>Priority: Major
>
> We have developed a WService to query a gremlin-server (JanusGraph 0.5.1) 
> using the .net driver. Using the opencypher plugin has allowed us to see a 
> behaviour where the server gets completely blocked after a timeout on the 
> server side. We thought this might be related to issue 
> https://issues.apache.org/jira/browse/TINKERPOP-2288, so we have moved our 
> driver version to the master one (3.4-dev, which includes the PR solving this 
> issue). However, when facing a timeout (server side always, it is the one 
> launching the exception), quite a lot of connections get stalled at 
> CLOSE_WAIT status, and the server becomes unusable. 
> I've been digging around other bugs and issues, and from what I've read, some 
> similar behaviour happened to CosmoDB (although it might be caused in that 
> situation due to the some connection leaks, in this case is the timeout). We 
> have traced down the problem to the driver itself after isolating all the 
> components involved (optimizing the cypher query results in a non-timeout 
> situation where everything is ok; forcing the timeout from pure gremlin 
> replicates the behaviour). 
> We have set up the connection pool params to 16 / 4096 (we are expecting 
> quite a high concurrency load).  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TINKERPOP-2390) Connections not released when closed abruptly in the server side

2021-04-14 Thread Carlos (Jira)


[ 
https://issues.apache.org/jira/browse/TINKERPOP-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321134#comment-17321134
 ] 

Carlos commented on TINKERPOP-2390:
---

Hi Florian, 

we moved our implementation to the Java driver. However, the scenario where we 
saw this behaviour was not exactly due to timeout, but due to overwhelming the 
server with petitions (not so many in fact). When the underlying Tinkerpop 
provider was saturated (janusgraph in this case), seemed to start closing 
connections and the Tinkerpop server was not notifying properly the client, 
that's why we saw a lot of dead connections. I'm afraid I cannot give more 
details rather than it was a continuous flow of requests that provoked the 
behaviour. 

 

Best, 

> Connections not released when closed abruptly in the server side
> 
>
> Key: TINKERPOP-2390
> URL: https://issues.apache.org/jira/browse/TINKERPOP-2390
> Project: TinkerPop
>  Issue Type: Bug
>  Components: dotnet
>Affects Versions: 3.4.7
> Environment: Tinkerpop 3.4.7 + Janusgraph 0.5.1 (optional opencypher 
> 1.0.0) 
>Reporter: Carlos
>Priority: Major
>
> We have developed a WService to query a gremlin-server (JanusGraph 0.5.1) 
> using the .net driver. Using the opencypher plugin has allowed us to see a 
> behaviour where the server gets completely blocked after a timeout on the 
> server side. We thought this might be related to issue 
> https://issues.apache.org/jira/browse/TINKERPOP-2288, so we have moved our 
> driver version to the master one (3.4-dev, which includes the PR solving this 
> issue). However, when facing a timeout (server side always, it is the one 
> launching the exception), quite a lot of connections get stalled at 
> CLOSE_WAIT status, and the server becomes unusable. 
> I've been digging around other bugs and issues, and from what I've read, some 
> similar behaviour happened to CosmoDB (although it might be caused in that 
> situation due to the some connection leaks, in this case is the timeout). We 
> have traced down the problem to the driver itself after isolating all the 
> components involved (optimizing the cypher query results in a non-timeout 
> situation where everything is ok; forcing the timeout from pure gremlin 
> replicates the behaviour). 
> We have set up the connection pool params to 16 / 4096 (we are expecting 
> quite a high concurrency load).  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TINKERPOP-2390) Connections not released when closed abruptly in the server side

2021-04-14 Thread Florian Hockmann (Jira)


[ 
https://issues.apache.org/jira/browse/TINKERPOP-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321093#comment-17321093
 ] 

Florian Hockmann commented on TINKERPOP-2390:
-

I just tried to reproduce the scenario but I don't see anything wrong. Here is 
what I did:
 # Start the server with a {{gremlinPool}} of 1 as described above 
(_TinkerpopServer configured not to provide any concurrent service (i.e., all 
the queries were processed sequentially_).
 # Connect from Gremlin.Net (I used the version from current {{master}} and 
also tried it with the version from {{3.4-dev}}) with default settings 
({{PoolSize}} of 4 and {{MaxInProcessPerConnection}}: 32)
 # Send 10 requests with a custom evaluation timeout of 1 ms that simply sleep 
for 3 seconds.
 # Result:
 ## All requests get a {{ResponseException}} with a timeout on the server side.
 ## 4 connections in state {{ESTABLISHED}} on the server side.
 # Send 1 request to verify that both the driver and the server are still in a 
valid state. -> Receive the expected result.
 # Dispose the {{GremlinClient}} instance.
 # Result:
 ## All 4 connections in state {{TIME_WAIT}} on the server
 ## After 1 min: connections completely closed

The server is still responsive after this. The {{TIME_WAIT}} is expected from 
my limited knowledge about TCP as connections are not completely closed 
immediately in case a packet is received out of order. But they are closed 
after a timeout which seems to be one minute on my machine.

What I really don't understand here is why the server should close the 
connection just because one request ran into a timeout. That doesn't make much 
sense as multiple requests can be processed on the same connection. So, the 
connection shouldn't be affected by a failing request (failing here in the 
sense of timing out).

[~Bobed] Could you please provide more information on this, ideally a setup to 
reproduce the problem deterministically? Otherwise, I'm inclined to close this 
issue as we cannot reproduce it.

> Connections not released when closed abruptly in the server side
> 
>
> Key: TINKERPOP-2390
> URL: https://issues.apache.org/jira/browse/TINKERPOP-2390
> Project: TinkerPop
>  Issue Type: Bug
>  Components: dotnet
>Affects Versions: 3.4.7
> Environment: Tinkerpop 3.4.7 + Janusgraph 0.5.1 (optional opencypher 
> 1.0.0) 
>Reporter: Carlos
>Priority: Major
>
> We have developed a WService to query a gremlin-server (JanusGraph 0.5.1) 
> using the .net driver. Using the opencypher plugin has allowed us to see a 
> behaviour where the server gets completely blocked after a timeout on the 
> server side. We thought this might be related to issue 
> https://issues.apache.org/jira/browse/TINKERPOP-2288, so we have moved our 
> driver version to the master one (3.4-dev, which includes the PR solving this 
> issue). However, when facing a timeout (server side always, it is the one 
> launching the exception), quite a lot of connections get stalled at 
> CLOSE_WAIT status, and the server becomes unusable. 
> I've been digging around other bugs and issues, and from what I've read, some 
> similar behaviour happened to CosmoDB (although it might be caused in that 
> situation due to the some connection leaks, in this case is the timeout). We 
> have traced down the problem to the driver itself after isolating all the 
> components involved (optimizing the cypher query results in a non-timeout 
> situation where everything is ok; forcing the timeout from pure gremlin 
> replicates the behaviour). 
> We have set up the connection pool params to 16 / 4096 (we are expecting 
> quite a high concurrency load).  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TINKERPOP-2390) Connections not released when closed abruptly in the server side

2020-11-23 Thread Carlos (Jira)


[ 
https://issues.apache.org/jira/browse/TINKERPOP-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17237258#comment-17237258
 ] 

Carlos commented on TINKERPOP-2390:
---

Hi Florian, 

> Your REST layer is not sitting between Gremlin.NET and the server, but using 
>Gremlin.NET to provide access to the server for clients via REST, right?

yes, that's it.  The REST API uses Gremlin.NET to pose the queries to 
Janusgraph. 

 
{quote}and check the nestat of the client machine
{quote}
>Here you mean the client of your REST service? And you mean {{netstat}}, 
>right? Could you provide the exact command you used here?

Sorry for the ambiguity, here client would be the machine running the server 
providing the REST API (external client => server with the REST API * => 
Janusgraph). In the *-machine, it was netstat the command used (typo), but I 
cannot recall exacltly which netstat options we used (I assume that netstat -t 
tcp to check the status of all the tcp connections and check why it was 
stalled). 

 

Finally, yes, it should not be so difficult to reproduce. Exactly as you wrote, 
posing some queries that will run into a timeout will do the trick. 

Another possible detail to bear in mind would be that we had the 
TinkerpopServer configured not to provide any concurrent service (i.e., all the 
queries were processed sequentially). I forgot to mention this as with the Java 
driver it didn't seem to make a difference (but just in case it could be 
relevant for reproducibility purposes). 

 

Best,

> Connections not released when closed abruptly in the server side
> 
>
> Key: TINKERPOP-2390
> URL: https://issues.apache.org/jira/browse/TINKERPOP-2390
> Project: TinkerPop
>  Issue Type: Bug
>  Components: dotnet
>Affects Versions: 3.4.7
> Environment: Tinkerpop 3.4.7 + Janusgraph 0.5.1 (optional opencypher 
> 1.0.0) 
>Reporter: Carlos
>Priority: Major
>
> We have developed a WService to query a gremlin-server (JanusGraph 0.5.1) 
> using the .net driver. Using the opencypher plugin has allowed us to see a 
> behaviour where the server gets completely blocked after a timeout on the 
> server side. We thought this might be related to issue 
> https://issues.apache.org/jira/browse/TINKERPOP-2288, so we have moved our 
> driver version to the master one (3.4-dev, which includes the PR solving this 
> issue). However, when facing a timeout (server side always, it is the one 
> launching the exception), quite a lot of connections get stalled at 
> CLOSE_WAIT status, and the server becomes unusable. 
> I've been digging around other bugs and issues, and from what I've read, some 
> similar behaviour happened to CosmoDB (although it might be caused in that 
> situation due to the some connection leaks, in this case is the timeout). We 
> have traced down the problem to the driver itself after isolating all the 
> components involved (optimizing the cypher query results in a non-timeout 
> situation where everything is ok; forcing the timeout from pure gremlin 
> replicates the behaviour). 
> We have set up the connection pool params to 16 / 4096 (we are expecting 
> quite a high concurrency load).  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TINKERPOP-2390) Connections not released when closed abruptly in the server side

2020-11-23 Thread Florian Hockmann (Jira)


[ 
https://issues.apache.org/jira/browse/TINKERPOP-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17237253#comment-17237253
 ] 

Florian Hockmann commented on TINKERPOP-2390:
-

Thanks for getting back on this. I'm trying to understand your setup and have a 
few questions on that:

Your REST layer is not sitting between Gremlin.NET and the server, but using 
Gremlin.NET to provide access to the server for clients via REST, right? So, 
the REST facade you mention is only necessary to have the setup completely 
running, including your REST service and its client. Or is the REST facade 
actually between Gremlin.NET and the server somehow? (Which would mean that it 
needs to support Websockets though.)

Now about this part:
{quote}and check the nestat of the client machine
{quote}
Here you mean the client of your REST service? And you mean {{netstat}}, right? 
Could you provide the exact command you used here?

 

If I understood your description correctly, then it should be possible to 
reproduce this without the REST component by just sending some queries that 
will run into a timeout.

> Connections not released when closed abruptly in the server side
> 
>
> Key: TINKERPOP-2390
> URL: https://issues.apache.org/jira/browse/TINKERPOP-2390
> Project: TinkerPop
>  Issue Type: Bug
>  Components: dotnet
>Affects Versions: 3.4.7
> Environment: Tinkerpop 3.4.7 + Janusgraph 0.5.1 (optional opencypher 
> 1.0.0) 
>Reporter: Carlos
>Priority: Major
>
> We have developed a WService to query a gremlin-server (JanusGraph 0.5.1) 
> using the .net driver. Using the opencypher plugin has allowed us to see a 
> behaviour where the server gets completely blocked after a timeout on the 
> server side. We thought this might be related to issue 
> https://issues.apache.org/jira/browse/TINKERPOP-2288, so we have moved our 
> driver version to the master one (3.4-dev, which includes the PR solving this 
> issue). However, when facing a timeout (server side always, it is the one 
> launching the exception), quite a lot of connections get stalled at 
> CLOSE_WAIT status, and the server becomes unusable. 
> I've been digging around other bugs and issues, and from what I've read, some 
> similar behaviour happened to CosmoDB (although it might be caused in that 
> situation due to the some connection leaks, in this case is the timeout). We 
> have traced down the problem to the driver itself after isolating all the 
> components involved (optimizing the cypher query results in a non-timeout 
> situation where everything is ok; forcing the timeout from pure gremlin 
> replicates the behaviour). 
> We have set up the connection pool params to 16 / 4096 (we are expecting 
> quite a high concurrency load).  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TINKERPOP-2390) Connections not released when closed abruptly in the server side

2020-11-22 Thread Carlos (Jira)


[ 
https://issues.apache.org/jira/browse/TINKERPOP-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17236873#comment-17236873
 ] 

Carlos commented on TINKERPOP-2390:
---

On the server side, the only exception we saw was a script timeout exception 
(we had set up the script timeout to 30s, and the query was longer). We built a 
REST layer on top of the Gremlin.Net driver and the behaviour we witnessed was 
that the connections stucked in TIME_WAIT instead of CLOSE_WAIT (the local side 
wasn't aware of the fact that the server had closed the connection ... this 
lead to a driver completely blocked when several long queries gave the timeout 
exception). We had no other choice than restarting the VM (we tried deploying 
pure VMs and pods). 

Regarding the new version, I'm afraid I cannot answer for sure as we ended up 
moving to the Java driver. I recall doing some tests with the latest version 
just before moving to the other driver (we used the code from the branch 
including that fix) and it still happen, but I cannot state that we tested 
3.4.8 version for sure. The behaviour is quite easy to reproduce though, just 
send several long queries to the driver (setting up a short script timeout on 
the server side) and check the nestat of the client machine (the setup should 
be a REST fassade, in order to keep the client alive). 

We discarded a problem at Janusgraph side as with the Java driver, we didn't 
have this problem (I assume that the Gremlin Driver abstracts from the actual 
underlying provider). 

Best, 

> Connections not released when closed abruptly in the server side
> 
>
> Key: TINKERPOP-2390
> URL: https://issues.apache.org/jira/browse/TINKERPOP-2390
> Project: TinkerPop
>  Issue Type: Bug
>  Components: dotnet
>Affects Versions: 3.4.7
> Environment: Tinkerpop 3.4.7 + Janusgraph 0.5.1 (optional opencypher 
> 1.0.0) 
>Reporter: Carlos
>Priority: Major
>
> We have developed a WService to query a gremlin-server (JanusGraph 0.5.1) 
> using the .net driver. Using the opencypher plugin has allowed us to see a 
> behaviour where the server gets completely blocked after a timeout on the 
> server side. We thought this might be related to issue 
> https://issues.apache.org/jira/browse/TINKERPOP-2288, so we have moved our 
> driver version to the master one (3.4-dev, which includes the PR solving this 
> issue). However, when facing a timeout (server side always, it is the one 
> launching the exception), quite a lot of connections get stalled at 
> CLOSE_WAIT status, and the server becomes unusable. 
> I've been digging around other bugs and issues, and from what I've read, some 
> similar behaviour happened to CosmoDB (although it might be caused in that 
> situation due to the some connection leaks, in this case is the timeout). We 
> have traced down the problem to the driver itself after isolating all the 
> components involved (optimizing the cypher query results in a non-timeout 
> situation where everything is ok; forcing the timeout from pure gremlin 
> replicates the behaviour). 
> We have set up the connection pool params to 16 / 4096 (we are expecting 
> quite a high concurrency load).  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TINKERPOP-2390) Connections not released when closed abruptly in the server side

2020-11-11 Thread Florian Hockmann (Jira)


[ 
https://issues.apache.org/jira/browse/TINKERPOP-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17229988#comment-17229988
 ] 

Florian Hockmann commented on TINKERPOP-2390:
-

Can you show the exception you see on the server? And what is the behavior that 
you see in the Gremlin.Net driver at that point?

Also, could you please try whether the issue still occurs in Gremlin.Net 3.4.8 
as that includes some changes to the driver that might be relevant here?

> Connections not released when closed abruptly in the server side
> 
>
> Key: TINKERPOP-2390
> URL: https://issues.apache.org/jira/browse/TINKERPOP-2390
> Project: TinkerPop
>  Issue Type: Bug
>  Components: dotnet
>Affects Versions: 3.4.7
> Environment: Tinkerpop 3.4.7 + Janusgraph 0.5.1 (optional opencypher 
> 1.0.0) 
>Reporter: Carlos
>Priority: Major
>
> We have developed a WService to query a gremlin-server (JanusGraph 0.5.1) 
> using the .net driver. Using the opencypher plugin has allowed us to see a 
> behaviour where the server gets completely blocked after a timeout on the 
> server side. We thought this might be related to issue 
> https://issues.apache.org/jira/browse/TINKERPOP-2288, so we have moved our 
> driver version to the master one (3.4-dev, which includes the PR solving this 
> issue). However, when facing a timeout (server side always, it is the one 
> launching the exception), quite a lot of connections get stalled at 
> CLOSE_WAIT status, and the server becomes unusable. 
> I've been digging around other bugs and issues, and from what I've read, some 
> similar behaviour happened to CosmoDB (although it might be caused in that 
> situation due to the some connection leaks, in this case is the timeout). We 
> have traced down the problem to the driver itself after isolating all the 
> components involved (optimizing the cypher query results in a non-timeout 
> situation where everything is ok; forcing the timeout from pure gremlin 
> replicates the behaviour). 
> We have set up the connection pool params to 16 / 4096 (we are expecting 
> quite a high concurrency load).  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TINKERPOP-2390) Connections not released when closed abruptly in the server side

2020-07-07 Thread Carlos (Jira)


[ 
https://issues.apache.org/jira/browse/TINKERPOP-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152504#comment-17152504
 ] 

Carlos commented on TINKERPOP-2390:
---

>From what I've read so far, it might be related to issue 2369

> Connections not released when closed abruptly in the server side
> 
>
> Key: TINKERPOP-2390
> URL: https://issues.apache.org/jira/browse/TINKERPOP-2390
> Project: TinkerPop
>  Issue Type: Bug
>  Components: dotnet
>Affects Versions: 3.4.7
> Environment: Tinkerpop 3.4.7 + Janusgraph 0.5.1 (optional opencypher 
> 1.0.0) 
>Reporter: Carlos
>Priority: Major
>
> We have developed a WService to query a gremlin-server (JanusGraph 0.5.1) 
> using the .net driver. Using the opencypher plugin has allowed us to see a 
> behaviour where the server gets completely blocked after a timeout on the 
> server side. We thought this might be related to issue 
> https://issues.apache.org/jira/browse/TINKERPOP-2288, so we have moved our 
> driver version to the master one (3.4-dev, which includes the PR solving this 
> issue). However, when facing a timeout (server side always, it is the one 
> launching the exception), quite a lot of connections get stalled at 
> CLOSE_WAIT status, and the server becomes unusable. 
> I've been digging around other bugs and issues, and from what I've read, some 
> similar behaviour happened to CosmoDB (although it might be caused in that 
> situation due to the some connection leaks, in this case is the timeout). We 
> have traced down the problem to the driver itself after isolating all the 
> components involved (optimizing the cypher query results in a non-timeout 
> situation where everything is ok; forcing the timeout from pure gremlin 
> replicates the behaviour). 
> We have set up the connection pool params to 16 / 4096 (we are expecting 
> quite a high concurrency load).  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)