[
https://issues.apache.org/jira/browse/TINKERPOP-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922392#comment-16922392
]
Florian Hockmann edited comment on TINKERPOP-2288 at 9/4/19 11:19 AM:
----------------------------------------------------------------------
This seems to be a problem many users of Cosmos DB have with Gremlin.Net which
we thought we had already fixed before (see TINKERPOP-2090).
[~pathuot]:
{quote}As a work around, Is there a way we can access this information from the
code so that I can catch those scenario and create logic that re-initiate the
connection pool?
{quote}
You can't directly access this information. What you can do as a work around
right now is catching the exception and then either retrying the request or if
it fails too often, disposing the {{GremlinClient}} and then creating a new one
which will also create new connections.
[~samimajed]:
{quote}Yes, it does seem like gremlinConnection.Client.NrConnections is the
only information that is accessible to log to a consumer of the library. Is
that correct?
{quote}
That's correct.
{quote}It would be great to have the following data available:
* Current open web socket connections in pool (or remaining available)
* Current number of in-flight requests in whichever web socket that's being
used for my query{quote}
Since we currently don't use these values for anything in the pool, we just
don't have them. So, we would need to compute them on the fly by iterating over
all connections. I'm not sure whether it's really a good idea to provide this
information if we don't already have it available as users are probably not
aware of the cost to compute it. It would also only allow for workarounds for
this problem that we should ultimately better solve in the driver itself, right?
Another step to make the state of the connection pool more visible to users
would be to add logging to the driver. We could then log each time we detect a
dead connection or if a connection reaches its max in process limit for example.
[~jmondal]: Thanks for taking the time to investigate this!
{quote}Users need to catch these and move forward to request new connections
and wait for the connection pool to be ultimately populated.
{quote}
Yes, that's the workaround I would recommend right now for users of Cosmos DB.
{quote}Perhaps Gremlin .NET needs to find a way to contact the server before
throwing a ServerUnavailableException() exception.
{quote}
Yes, Cosmos DB closing idle connections (while apparently ignoring WS keep
alive pings) is a good reason for us to try replacing closed connections before
throwing an exception. This can of course also help in other scenarios where
the server was just temporarily not reachable but can be reached again when the
request comes in. So, this is something we should implement in general and not
just for this case of Cosmos DB closing idle connections.
I see three alternatives to implement this:
*Option 1*: {{EnsurePoolIsPopulatedAsync()}} iterates over all connections to
check whether they are open & replaces closed connections
* Advantage: Logic to populate the pool is kept in one place
* Disadvantage: Makes each request more expensive as the driver has to iterate
over all connections
*Option 2*: {{TryGetAvailableConnection()}} replaces closed connections directly
* Advantage: Time between check and usage is very short -> race condition
unlikely
* Disadvantage: Slows down requests even if another connection is available
for the request
*Option 3*: Background tasks that regularly check all connections and replace
closed ones
* Advantage: Latency of requests (basically) unaffected
* Advantage: Connections can be replaced whenever the server is available again
* Advantage: We can potentially move the population of the pool and the
removal of closed connections out of the normal request processing completely
into a background task.
* Disadvantage: Highest complexity
* Disadvantage: Time between closing of a connection and the next check in
which requests will still fail. This is however not that much of a problem for
an idle connection as we can probably replace the dead connection before new
requests arrive.
Option 1 is the one you suggested, [~jmondal], if I understood it correctly. I
tend overall however to option 3 as it doesn't impact the latency of requests
and because it allows us to remove the pool resizing operations out of the
usual request processing which also means that we can prevent a situation from
happening where the pool is sized down and up at the same time. [~jorgebg]
already [suggested to use a task
scheduler|https://github.com/apache/tinkerpop/pull/1077#issuecomment-469640764]
for this reason in the PR that introduced round robin scheduling of connections.
What do others think about this?
edit: [~spmallette] I read your comment after writing this one: TINKERPOP-2215
might make it more clear why the exception is thrown but it doesn't solve the
problem.
was (Author: florian hockmann):
This seems to be a problem many users of Cosmos DB have with Gremlin.Net which
we thought we had already fixed before (see TINKERPOP-2090).
[~pathuot]:
{quote}As a work around, Is there a way we can access this information from the
code so that I can catch those scenario and create logic that re-initiate the
connection pool?
{quote}
You can't directly access this information. What you can do as a work around
right now is catching the exception and then either retrying the request or if
it fails too often, disposing the {{GremlinClient}} and then creating a new one
which will also create new connections.
[~samimajed]:
{quote}Yes, it does seem like gremlinConnection.Client.NrConnections is the
only information that is accessible to log to a consumer of the library. Is
that correct?
{quote}
That's correct.
{quote}It would be great to have the following data available:
* Current open web socket connections in pool (or remaining available)
* Current number of in-flight requests in whichever web socket that's being
used for my query{quote}
Since we currently don't use these values for anything in the pool, we just
don't have them. So, we would need to compute them on the fly by iterating over
all connections. I'm not sure whether it's really a good idea to provide this
information if we don't already have it available as users are probably not
aware of the cost to compute it. It would also only allow for workarounds for
this problem that we should ultimately better solve in the driver itself, right?
Another step to make the state of the connection pool more visible to users
would be to add logging to the driver. We could then log each time we detect a
dead connection or if a connection reaches its max in process limit for example.
[~jmondal]: Thanks for taking the time to investigate this!
{quote}Users need to catch these and move forward to request new connections
and wait for the connection pool to be ultimately populated.
{quote}
Yes, that's the workaround I would recommend right now for users of Cosmos DB.
{quote}Perhaps Gremlin .NET needs to find a way to contact the server before
throwing a ServerUnavailableException() exception.
{quote}
Yes, Cosmos DB closing idle connections (while apparently ignoring WS keep
alive pings) is a good reason for us to try replacing closed connections before
throwing an exception. This can of course also help in other scenarios where
the server was just temporarily not reachable but can be reached again when the
request comes in. So, this is something we should implement in general and not
just for this case of Cosmos DB closing idle connections.
I see three alternatives to implement this:
*Option 1*: {{EnsurePoolIsPopulatedAsync()}} iterates over all connections to
check whether they are open & replaces closed connections
* Advantage: Logic to populate the pool is kept in one place
* Disadvantage: Makes each request more expensive as the driver has to iterate
over all connections
*Option 2*: {{TryGetAvailableConnection()}} replaces closed connections directly
* Advantage: Time between check and usage is very short -> race condition
unlikely
* Disadvantage: Slows down requests even if another connection is available
for the request
*Option 3*: Background tasks that regularly check all connections and replace
closed ones
* Advantage: Latency of requests (basically) unaffected
* Advantage: Connections can be replaced whenever the server is available again
* Advantage: We can potentially move the population of the pool and the
removal of closed connections out of the normal request processing completely
into a background task.
* Disadvantage: Highest complexity
* Disadvantage: Time between closing of a connection and the next check in
which requests will still fail. This is however not that much of a problem for
an idle connection as we can probably replace the dead connection before new
requests arrive.
Option 1 is the one you suggested, [~jmondal], if I understood it correctly. I
tend overall however to option 3 as it doesn't impact the latency of requests
and because it allows us to remove the pool resizing operations out of the
usual request processing which also means that we can prevent a situation from
happening where the pool is sized down and up at the same time. [~jorgebg]
already [suggested to use a task
scheduler|https://github.com/apache/tinkerpop/pull/1077#issuecomment-469640764]
for this reason in the PR that introduced round robin scheduling of connections.
What do others think about this?
> Get ConnectionPoolBusyException and then ServerUnavailableExceptions
> --------------------------------------------------------------------
>
> Key: TINKERPOP-2288
> URL: https://issues.apache.org/jira/browse/TINKERPOP-2288
> Project: TinkerPop
> Issue Type: Bug
> Components: dotnet
> Affects Versions: 3.4.3
> Environment: Gremlin.Net 3.4.3
> Microsoft.NetCore.App 2.2
> Azure Cosmos DB
> Reporter: patrice huot
> Priority: Critical
>
> I am using .Net core Gremlin API query Cosmos DB.
> From time to time we are getting an error saying that no connection is
> available and then the server become unavailable. When this is occurring we
> need to restart the server. It looks like the connections are not released
> properly and become unavailable forever.
> We have configured the pool size to 50 and the MaxInProcessPerConnection to
> 32 (Which I guess should be sufficient).
> To diagnose the issue, Is there a way to access diagnostic information on the
> connection pool in order to know how many connections are open and how many
> processes are running in each connection?
> I would like to be able to monitor the connections usage to see if they are
> about to be exhausted and to see if the number of used connections is always
> increasing or of the connection lease is release when the queries completes?
> As a work around, Is there a way we can access this information from the code
> so that I can catch those scenario and create logic that re-initiate the
> connection pool?
>
>
--
This message was sent by Atlassian Jira
(v8.3.2#803003)