[jira] [Comment Edited] (TINKERPOP-2288) Get ConnectionPoolBusyException and then ServerUnavailableExceptions

Florian Hockmann (Jira) Wed, 04 Sep 2019 04:20:18 -0700


    [ 
https://issues.apache.org/jira/browse/TINKERPOP-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922392#comment-16922392
 ]


Florian Hockmann edited comment on TINKERPOP-2288 at 9/4/19 11:19 AM:
----------------------------------------------------------------------

This seems to be a problem many users of Cosmos DB have with Gremlin.Net which 
we thought we had already fixed before (see TINKERPOP-2090).

[~pathuot]:
{quote}As a work around, Is there a way we can access this information from the 
code so that I can catch those scenario and create logic that re-initiate the 
connection pool?
{quote}
You can't directly access this information. What you can do as a work around 
right now is catching the exception and then either retrying the request or if 
it fails too often, disposing the {{GremlinClient}} and then creating a new one 
which will also create new connections.

[~samimajed]:
{quote}Yes, it does seem like gremlinConnection.Client.NrConnections is the 
only information that is accessible to log to a consumer of the library. Is 
that correct?
{quote}
That's correct.
{quote}It would be great to have the following data available:
 * Current open web socket connections in pool (or remaining available)
 * Current number of in-flight requests in whichever web socket that's being 
used for my query{quote}
Since we currently don't use these values for anything in the pool, we just 
don't have them. So, we would need to compute them on the fly by iterating over 
all connections. I'm not sure whether it's really a good idea to provide this 
information if we don't already have it available as users are probably not 
aware of the cost to compute it. It would also only allow for workarounds for 
this problem that we should ultimately better solve in the driver itself, right?

Another step to make the state of the connection pool more visible to users 
would be to add logging to the driver. We could then log each time we detect a 
dead connection or if a connection reaches its max in process limit for example.

[~jmondal]: Thanks for taking the time to investigate this!
{quote}Users need to catch these and move forward to request new connections 
and wait for the connection pool to be ultimately populated.
{quote}
Yes, that's the workaround I would recommend right now for users of Cosmos DB.
{quote}Perhaps Gremlin .NET needs to find a way to contact the server before 
throwing a ServerUnavailableException() exception.
{quote}
Yes, Cosmos DB closing idle connections (while apparently ignoring WS keep 
alive pings) is a good reason for us to try replacing closed connections before 
throwing an exception. This can of course also help in other scenarios where 
the server was just temporarily not reachable but can be reached again when the 
request comes in. So, this is something we should implement in general and not 
just for this case of Cosmos DB closing idle connections.

I see three alternatives to implement this:

*Option 1*: {{EnsurePoolIsPopulatedAsync()}} iterates over all connections to 
check whether they are open & replaces closed connections
 * Advantage: Logic to populate the pool is kept in one place
 * Disadvantage: Makes each request more expensive as the driver has to iterate 
over all connections

*Option 2*: {{TryGetAvailableConnection()}} replaces closed connections directly
 * Advantage: Time between check and usage is very short -> race condition 
unlikely
 * Disadvantage: Slows down requests even if another connection is available 
for the request

*Option 3*: Background tasks that regularly check all connections and replace 
closed ones
 * Advantage: Latency of requests (basically) unaffected
 * Advantage: Connections can be replaced whenever the server is available again
 * Advantage: We can potentially move the population of the pool and the 
removal of closed connections out of the normal request processing completely 
into a background task.
 * Disadvantage: Highest complexity
 * Disadvantage: Time between closing of a connection and the next check in 
which requests will still fail. This is however not that much of a problem for 
an idle connection as we can probably replace the dead connection before new 
requests arrive.

Option 1 is the one you suggested, [~jmondal], if I understood it correctly. I 
tend overall however to option 3 as it doesn't impact the latency of requests 
and because it allows us to remove the pool resizing operations out of the 
usual request processing which also means that we can prevent a situation from 
happening where the pool is sized down and up at the same time. [~jorgebg] 
already [suggested to use a task 
scheduler|https://github.com/apache/tinkerpop/pull/1077#issuecomment-469640764] 
for this reason in the PR that introduced round robin scheduling of connections.

What do others think about this?

edit: [~spmallette] I read your comment after writing this one: TINKERPOP-2215 
might make it more clear why the exception is thrown but it doesn't solve the 
problem.


was (Author: florian hockmann):
This seems to be a problem many users of Cosmos DB have with Gremlin.Net which 
we thought we had already fixed before (see TINKERPOP-2090).

[~pathuot]:
{quote}As a work around, Is there a way we can access this information from the 
code so that I can catch those scenario and create logic that re-initiate the 
connection pool?
{quote}
You can't directly access this information. What you can do as a work around 
right now is catching the exception and then either retrying the request or if 
it fails too often, disposing the {{GremlinClient}} and then creating a new one 
which will also create new connections.

[~samimajed]:
{quote}Yes, it does seem like gremlinConnection.Client.NrConnections is the 
only information that is accessible to log to a consumer of the library. Is 
that correct?
{quote}
That's correct.
{quote}It would be great to have the following data available:
 * Current open web socket connections in pool (or remaining available)
 * Current number of in-flight requests in whichever web socket that's being 
used for my query{quote}
Since we currently don't use these values for anything in the pool, we just 
don't have them. So, we would need to compute them on the fly by iterating over 
all connections. I'm not sure whether it's really a good idea to provide this 
information if we don't already have it available as users are probably not 
aware of the cost to compute it. It would also only allow for workarounds for 
this problem that we should ultimately better solve in the driver itself, right?

Another step to make the state of the connection pool more visible to users 
would be to add logging to the driver. We could then log each time we detect a 
dead connection or if a connection reaches its max in process limit for example.

[~jmondal]: Thanks for taking the time to investigate this!
{quote}Users need to catch these and move forward to request new connections 
and wait for the connection pool to be ultimately populated.
{quote}
Yes, that's the workaround I would recommend right now for users of Cosmos DB.
{quote}Perhaps Gremlin .NET needs to find a way to contact the server before 
throwing a ServerUnavailableException() exception.
{quote}
Yes, Cosmos DB closing idle connections (while apparently ignoring WS keep 
alive pings) is a good reason for us to try replacing closed connections before 
throwing an exception. This can of course also help in other scenarios where 
the server was just temporarily not reachable but can be reached again when the 
request comes in. So, this is something we should implement in general and not 
just for this case of Cosmos DB closing idle connections.

I see three alternatives to implement this:

*Option 1*: {{EnsurePoolIsPopulatedAsync()}} iterates over all connections to 
check whether they are open & replaces closed connections
 * Advantage: Logic to populate the pool is kept in one place
 * Disadvantage: Makes each request more expensive as the driver has to iterate 
over all connections

*Option 2*: {{TryGetAvailableConnection()}} replaces closed connections directly
 * Advantage: Time between check and usage is very short -> race condition 
unlikely
 * Disadvantage: Slows down requests even if another connection is available 
for the request

*Option 3*: Background tasks that regularly check all connections and replace 
closed ones
 * Advantage: Latency of requests (basically) unaffected
 * Advantage: Connections can be replaced whenever the server is available again
 * Advantage: We can potentially move the population of the pool and the 
removal of closed connections out of the normal request processing completely 
into a background task.
 * Disadvantage: Highest complexity
 * Disadvantage: Time between closing of a connection and the next check in 
which requests will still fail. This is however not that much of a problem for 
an idle connection as we can probably replace the dead connection before new 
requests arrive.

Option 1 is the one you suggested, [~jmondal], if I understood it correctly. I 
tend overall however to option 3 as it doesn't impact the latency of requests 
and because it allows us to remove the pool resizing operations out of the 
usual request processing which also means that we can prevent a situation from 
happening where the pool is sized down and up at the same time. [~jorgebg] 
already [suggested to use a task 
scheduler|https://github.com/apache/tinkerpop/pull/1077#issuecomment-469640764] 
for this reason in the PR that introduced round robin scheduling of connections.

What do others think about this?

> Get ConnectionPoolBusyException and then ServerUnavailableExceptions
> --------------------------------------------------------------------
>
>                 Key: TINKERPOP-2288
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP-2288
>             Project: TinkerPop
>          Issue Type: Bug
>          Components: dotnet
>    Affects Versions: 3.4.3
>         Environment: Gremlin.Net 3.4.3
> Microsoft.NetCore.App 2.2
> Azure Cosmos DB
>            Reporter: patrice huot
>            Priority: Critical
>
> I am using .Net core Gremlin API  query Cosmos DB.
> From time to time we are getting an error saying that no connection is 
> available and then the server become unavailable. When this is occurring we 
> need to restart the server. It looks like the connections are not released 
> properly and become unavailable forever.
> We have configured the pool size to 50 and the MaxInProcessPerConnection to 
> 32 (Which I guess should be sufficient).
> To diagnose the issue, Is there a way to access diagnostic information on the 
> connection pool in order to know how many connections are open and how many 
> processes are running in each connection?
> I would like to be able to monitor the connections usage to see if they are 
> about to be exhausted and to see if the number of used connections is always 
> increasing or of the connection lease is release when the queries completes?
> As a work around, Is there a way we can access this information from the code 
> so that I can catch those scenario and create logic that re-initiate the 
> connection pool?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Comment Edited] (TINKERPOP-2288) Get ConnectionPoolBusyException and then ServerUnavailableExceptions

Reply via email to