On 2025-08-06 04:29, Ryan Schmitt wrote:
Since rolling out 5.5 (which includes my change for
`validateAfterInactivity`, which was previously interpreted as a connection
TTL for async HTTP 1.1 connections), I've been getting reports of
intermittent request failures which appear to be caused by stale connection
reuse. The async client is actually much more vulnerable to this issue than
we realized, and I'm trying to understand why.

First, let me state my understanding of how server-initiated connection
closure works. A FIN from the server wakes up the selector, which fires an
input event that is handled by the code in AbstractHttp1StreamDuplexer,
which "reads" the end-of-stream. This then initiates graceful shutdown of
the connection. I believe this works by the client closing its end of the
connection, which is an output event (the client has to send FIN, as well
as close_notify if TLS). Once this is handled, the duplexer transitions the
connection state to SHUTDOWN and closes the InternalDataChannel, which (for
a graceful shutdown) adds itself to the closedSessions queue. The IOReactor
processes this queue in processClosedSessions(), which will fail any
requests on the connection and remove it from the pool.

The basic issue seems to be that since connection reuse does not go through
the IOReactor, there's a race condition between the IOReactor's event loop
(which drives all the processing and bookkeeping for remote connection
closure) and request execution (which can draw a doomed connection from the
connection pool). I wrote a test that sends requests to a server, which
sends a response and then immediately closes the connection (without any
`Connection: close` header). Part of the idea here is to simulate an async
client running in AWS Lambda, where the JVM is suspended between
invocations: if invocations are intermittent, connections will be remotely
closed while the JVM is suspended, and then during the next invocation the
IOReactor will race with the client request in just this way.

What I found with this test setup is that sending requests one at a time
mostly works, but sometimes results in an exception (typically
RequestNotExecutedException). Setting a breakpoint on
PoolEntry::discardConnection, I can see that the client usually detects the
stale connection (the closure has been processed by the Duplexer):

     at
org.apache.hc.core5.pool.PoolEntry.discardConnection(PoolEntry.java:170)
     at
org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager$InternalConnectionEndpoint.isConnected(PoolingAsyncClientConnectionManager.java:739)
     at
org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime.isEndpointConnected(InternalHttpAsyncExecRuntime.java:208)
     at
org.apache.hc.client5.http.impl.async.AsyncConnectExec$1.completed(AsyncConnectExec.java:157)
     at
org.apache.hc.client5.http.impl.async.AsyncConnectExec$1.completed(AsyncConnectExec.java:153)
     at
org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime$1.completed(InternalHttpAsyncExecRuntime.java:128)
     at
org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime$1.completed(InternalHttpAsyncExecRuntime.java:120)
     at
org.apache.hc.core5.concurrent.BasicFuture.completed(BasicFuture.java:148)
     at
org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager$3$1.leaseCompleted(PoolingAsyncClientConnectionManager.java:336)


But sometimes, the connection closure is not processed in time, in which
case I see this stack trace, where the IOReactor fails the request and
discards the conn pool entry:

     at
org.apache.hc.core5.pool.PoolEntry.discardConnection(PoolEntry.java:170)
     at
org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager$InternalConnectionEndpoint.close(PoolingAsyncClientConnectionManager.java:724)
     at
org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime.discardEndpoint(InternalHttpAsyncExecRuntime.java:148)
     at
org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime.discardEndpoint(InternalHttpAsyncExecRuntime.java:180)
     at
org.apache.hc.client5.http.impl.async.InternalAbstractHttpAsyncClient$2.failed(InternalAbstractHttpAsyncClient.java:363)
     at
org.apache.hc.client5.http.impl.async.AsyncRedirectExec$1.failed(AsyncRedirectExec.java:261)
     at
org.apache.hc.client5.http.impl.async.ContentCompressionAsyncExec$1.failed(ContentCompressionAsyncExec.java:160)
     at
org.apache.hc.client5.http.impl.async.AsyncHttpRequestRetryExec$1.failed(AsyncHttpRequestRetryExec.java:203)
     at
org.apache.hc.client5.http.impl.async.AsyncProtocolExec$1.failed(AsyncProtocolExec.java:297)
     at
org.apache.hc.client5.http.impl.async.HttpAsyncMainClientExec$1.failed(HttpAsyncMainClientExec.java:135)
     at
org.apache.hc.core5.http.nio.command.RequestExecutionCommand.failed(RequestExecutionCommand.java:101)
     at
org.apache.hc.core5.http.nio.command.CommandSupport.cancelCommands(CommandSupport.java:68)
     at
org.apache.hc.core5.http.impl.nio.AbstractHttp1StreamDuplexer.onDisconnect(AbstractHttp1StreamDuplexer.java:415)
     at
org.apache.hc.core5.http.impl.nio.AbstractHttp1IOEventHandler.disconnected(AbstractHttp1IOEventHandler.java:95)
     at
org.apache.hc.core5.http.impl.nio.ClientHttp1IOEventHandler.disconnected(ClientHttp1IOEventHandler.java:41)
     at
org.apache.hc.core5.reactor.ssl.SSLIOSession$1.disconnected(SSLIOSession.java:253)
     at
org.apache.hc.core5.reactor.InternalDataChannel.disconnected(InternalDataChannel.java:205)
     at
org.apache.hc.core5.reactor.SingleCoreIOReactor.processClosedSessions(SingleCoreIOReactor.java:229)


Sending multiple requests in parallel (with a connection pool size of 1)
greatly exacerbates the race condition: _half_ of all requests fail. The
exceptions vary: RequestNotExecutedException, ConnectionClosedException
("Connection closed by peer"), and SocketException (connection reset) are
all possible, but what reliably happens is that every other request fails
due to stale (closed) connection reuse. I've reproduced this with HTTP,
TLSv1.2, and TLSv1.3.

What I'd like to know is:

1. Can we do anything to improve this race condition?

I doubt but I will re-visit the code and see what could be done.

Is there currently
any blocking work or any async commands that need to run before the conn
pool entry can be discarded, or the endpoint marked as disconnected?

I do not think so. Connections can get evicted from the pool and then whatever clean-up / post-close logic could be run while those connections no longer bound to the pool.

2. Can we do anything to detect stale connection reuse more reliably when
it does happen?

I doubt but I will look into it.

Are there cases where we can surface this as
a RequestNotExecutedException where we currently aren't?

Needs to be looked into.

3. Can the async client implement `validateAfterInactivity` for HTTP/1.1
requests?

Of course, it can. But the only possibility I see here is doing a full-blown message exchange with a relatively cheap method (such as OPTIONS or TRACE) to do so.

I think we could actually implement this with a synchronization
barrier, rather than IO operations: we validate inactive connections by
waiting for the next iteration of the IOReactor event loop. This would
allow any backlog of IO events to be processed, preventing the client from
racing with itself.


I am not entirely convinced this is a good idea but we can certainly give it a try.

Oleg


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@hc.apache.org
For additional commands, e-mail: dev-h...@hc.apache.org

Reply via email to