[
https://issues.apache.org/jira/browse/TINKERPOP-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17356410#comment-17356410
]
Florian Hockmann commented on TINKERPOP-2573:
---------------------------------------------
Thanks for reporting this and giving your thoughts on how we could improve
this, [~olivertowers]!
{quote}Gremlin server configurations can include a load-balancer in-front of
the server nodes (Azure Cosmos DB is one example).
{quote}
That is definitely something that we haven't accounted for enough. The driver
was originally built with the idea that a {{ConnectionPool}} holds connections
to a single server. TINKERPOP-1765 should add support for multiple servers and
then each server could have its own {{ConnectionPool}}. That's also how the
Java driver operates if I'm not mistaken. But that of course doesn't help if
multiple servers are behind a load balancer which is a setup that is getting
more and more common in cloud environments or with containers in general.
{quote}1. Modify {{FillPoolAsync}} to handle connection failures individually,
doing 'best-effort' replacement of missing connections.
2 If {{FillPoolAsync}} observes an exception, still throw but do not dispose
of connection tasks that were successful (and check that state of those
successful connections and dispose if they are in error?).
{quote}
Aren't these two points basically the same? We could solve this by iterating
over all connection tasks in the {{catch}} block, adding all connections from
successfully finished tasks (maybe after also checking their connection state)
to the pool and then throwing the exception.
This would of course leave the pool in a state where it doesn't have its
configured size, but the missing connections can still be added later, e.g.,
through the retry policy as you also already mentioned.
{quote}3. In ConnectionPool constructor, apply reconnect retry policy to
ReplaceDeadConnectionAsync() invocation
{quote}
I'm not sure about this one yet. If the server is already unavailable when the
user creates a {{GremlinClient}}, then I think that it's best to fail fast to
inform the user about the problem. I think it's more likely in this case that
the user supplied invalid connection parameters or that the server is
completely unreachable (e.g., due to a missing firewall exception) than a
transient error where a retry policy could help.
I also would like to avoid that calling a constructor can take a long time.
Creating the connections in the constructor of course also already takes some
time and is beyond of what a constructor should usually do. Maybe we could move
this logic in some {{ConnectAsync}} method in the future. That method could be
a static factory method and we could then make the constructor private, but
that would of course be a breaking change so it would have to wait for 3.6.0.
These are the two reasons I see against a retry policy in the constructor, but
I'm also not completely against this. Do you frequently see the situation where
calling the constructor throws due to a transient error or maybe an error with
just one of many servers behind a load balancer?
{quote}To allow the underlying errors to be observed by callers, a
callback/event handler could be added that is invoked in the
{{ReplaceClosedConnectionsAsync}} throws. This would allow for the any
exceptions to be observed and collected in some manner.
{quote}
That's true. My take so far was that we should add logging to the driver
(TINKERPOP-2471) to make things like these transparent to the user. Do you
think that callbacks / an event handler would be a better approach than
logging? I think logging would be a lot easier to use as users would only have
to provide a logger.
{quote}Further refinement of this: Have a way of populating
{{ServerUnavailableException}} inner exception with the last observed
replacement exception.
{quote}
Interesting idea. We currently don't do that as the connection replacement is
handled asynchronously in the background by a different task, but we could of
course save the last thrown exception and use that as the inner exception.
> Gremlin.NET: Retry on CreateNewConnectionAsync() failures and avoid disposing
> successful connections on a failure
> -----------------------------------------------------------------------------------------------------------------
>
> Key: TINKERPOP-2573
> URL: https://issues.apache.org/jira/browse/TINKERPOP-2573
> Project: TinkerPop
> Issue Type: Improvement
> Components: dotnet
> Affects Versions: 3.4.11
> Reporter: Oliver Towers
> Priority: Minor
>
> [FillPoolAsync|https://github.com/apache/tinkerpop/blob/dc4a5d8795e836a35ce4b61247b9693fa0cea8cb/gremlin-dotnet/src/Gremlin.Net/Driver/ConnectionPool.cs#L119]
> attempts to create connections to fill the pool up to the target
> {{PoolSize}}, issuing {{CreateNewConnectionAsync}} to build-out or recover
> the delta.
> If one of these tasks fault (eg. due to some socket error during websocket
> handshake), then all successful connections are disposed and an exception is
> thrown, which can result in either:
> * {{ConnectionPool}} constructor throws, so the only option is to retry
> creating the client.
> * A fire-and-forget task for {{ReplaceDeadConnectionAsync}} fails and the
> connection pool cannot recover the pool size at that time. The
> {{GetAvailableConnectionAsync}} retry policy can kick-in here and the pool
> may eventually recover.
> Gremlin server configurations can include a load-balancer in-front of the
> server nodes (Azure Cosmos DB is one example). So there are cases where
> faults with new connections, are transient, but successful/healthy connection
> attempts can be maintained without destroying them.
> Given this the proposal is the following:
> # Modify {{FillPoolAsync}} to handle connection failures individually, doing
> 'best-effort' replacement of missing connections.
> # If {{FillPoolAsync}} observes an exception, still throw but do not dispose
> of connection tasks that were successful (and check that state of those
> successful connections and dispose if they are in error?).
> # In {{ConnectionPool}} constructor, apply reconnect retry policy to
> {{ReplaceDeadConnectionAsync()}} invocation
> ## Retry on any Exception (since there are various IO/Socket exceptions that
> can be thrown here). This maintains the previous behavior where the pool must
> be filled in-full at construction, and throws the underlying exception to the
> caller.
> *Additional note:*
> Post construction, if {{GetAvailableConnection}} triggers a connection
> replacement fire-and-forget task where the task encounters network
> IO/handshake error, then these exceptions cannot be observed directly by the
> caller. Only a {{ServerUnavailableException}} is thrown if the retries are
> exhausted.
> To allow the underlying errors to be observed by callers, a callback/event
> handler could be added that is invoked in the
> {{ReplaceClosedConnectionsAsync}} throws. This would allow for the any
> exceptions to be observed and collected in some manner.
> Further refinement of this: Have a way of populating
> {{ServerUnavailableException}} inner exception with the last observed
> replacement exception.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)