Oliver Towers created TINKERPOP-2573:
----------------------------------------
Summary: Gremlin.NET: Retry on CreateNewConnectionAsync() failures
and avoid disposing successful connections on a failure
Key: TINKERPOP-2573
URL: https://issues.apache.org/jira/browse/TINKERPOP-2573
Project: TinkerPop
Issue Type: Improvement
Components: dotnet
Affects Versions: 3.4.11
Reporter: Oliver Towers
[FillPoolAsync|https://github.com/apache/tinkerpop/blob/dc4a5d8795e836a35ce4b61247b9693fa0cea8cb/gremlin-dotnet/src/Gremlin.Net/Driver/ConnectionPool.cs#L119]
attempts to create connections to fill the pool up to the target {{PoolSize}},
issuing {{CreateNewConnectionAsync}} to build-out or recover the delta.
If one of these tasks fault (eg. due to some socket error during websocket
handshake), then all successful connections are disposed and an exception is
thrown, which can result in either:
* {{ConnectionPool}} constructor throws, so the only option is to retry
creating the client.
* A fire-and-forget task for {{ReplaceDeadConnectionAsync}} fails and the
connection pool cannot recover the pool size at that time. The
{{GetAvailableConnectionAsync}} retry policy can kick-in here and the pool may
eventually recover.
Gremlin server configurations can include a load-balancer in-front of the
server nodes (Azure Cosmos DB is one example). So there are cases where faults
with new connections, are transient, but successful/healthy connection attempts
can be maintained without destroying them.
Given this the proposal is the following:
# Modify {{FillPoolAsync}} to handle connection failures individually, doing
'best-effort' replacement of missing connections.
# If {{FillPoolAsync}} observes an exception, still throw but do not dispose
of connection tasks that were successful (and check that state of those
successful connections and dispose if they are in error?).
# In {{ConnectionPool}} constructor, apply reconnect retry policy to
{{ReplaceDeadConnectionAsync()}} invocation
## Retry on any Exception (since there are various IO/Socket exceptions that
can be thrown here). This maintains the previous behavior where the pool must
be filled in-full at construction, and throws the underlying exception to the
caller.
*Additional note:*
Post construction, if {{GetAvailableConnection}} triggers a connection
replacement fire-and-forget task where the task encounters network IO/handshake
error, then these exceptions cannot be observed directly by the caller. Only a
{{ServerUnavailableException}} is thrown if the retries are exhausted.
To allow the underlying errors to be observed by callers, a callback/event
handler could be added that is invoked in the {{ReplaceClosedConnectionsAsync}}
throws. This would allow for the any exceptions to be observed and collected in
some manner.
Further refinement of this: Have a way of populating
{{ServerUnavailableException}} inner exception with the last observed
replacement exception.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)