Oliver Towers created TINKERPOP-2573:
----------------------------------------

             Summary: Gremlin.NET: Retry on CreateNewConnectionAsync() failures 
and avoid disposing successful connections on a failure
                 Key: TINKERPOP-2573
                 URL: https://issues.apache.org/jira/browse/TINKERPOP-2573
             Project: TinkerPop
          Issue Type: Improvement
          Components: dotnet
    Affects Versions: 3.4.11
            Reporter: Oliver Towers


[FillPoolAsync|https://github.com/apache/tinkerpop/blob/dc4a5d8795e836a35ce4b61247b9693fa0cea8cb/gremlin-dotnet/src/Gremlin.Net/Driver/ConnectionPool.cs#L119]
 attempts to create connections to fill the pool up to the target {{PoolSize}}, 
issuing {{CreateNewConnectionAsync}} to build-out or recover the delta.

If one of these tasks fault (eg. due to some socket error during websocket 
handshake), then all successful connections are disposed and an exception is 
thrown, which can result in either:
 * {{ConnectionPool}} constructor throws, so the only option is to retry 
creating the client.
 * A fire-and-forget task for {{ReplaceDeadConnectionAsync}} fails and the 
connection pool cannot recover the pool size at that time. The 
{{GetAvailableConnectionAsync}} retry policy can kick-in here and the pool may 
eventually recover.

Gremlin server configurations can include a load-balancer in-front of the 
server nodes (Azure Cosmos DB is one example). So there are cases where faults 
with new connections, are transient, but successful/healthy connection attempts 
can be maintained without destroying them.

Given this the proposal is the following:
 # Modify {{FillPoolAsync}} to handle connection failures individually, doing 
'best-effort' replacement of missing connections.
 # If {{FillPoolAsync}} observes an exception, still throw but do not dispose 
of connection tasks that were successful (and check that state of those 
successful connections and dispose if they are in error?).
 # In {{ConnectionPool}} constructor, apply reconnect retry policy to 
{{ReplaceDeadConnectionAsync()}} invocation
 ## Retry on any Exception (since there are various IO/Socket exceptions that 
can be thrown here). This maintains the previous behavior where the pool must 
be filled in-full at construction, and throws the underlying exception to the 
caller.

*Additional note:*
Post construction, if {{GetAvailableConnection}} triggers a connection 
replacement fire-and-forget task where the task encounters network IO/handshake 
error, then these exceptions cannot be observed directly by the caller. Only a 
{{ServerUnavailableException}} is thrown if the retries are exhausted.

To allow the underlying errors to be observed by callers, a callback/event 
handler could be added that is invoked in the {{ReplaceClosedConnectionsAsync}} 
throws. This would allow for the any exceptions to be observed and collected in 
some manner.

Further refinement of this: Have a way of populating 
{{ServerUnavailableException}} inner exception with the last observed 
replacement exception.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to