Re: Stale connection reuse in the async client

Ryan Schmitt Wed, 06 Aug 2025 11:34:22 -0700

Incidentally, I just spotted this commit:

https://github.com/apache/httpcomponents-client/commit/99d4a5e081f31616a2558e72824ed8cf52198596


I didn't even know that Java had these socket options. This will be a big
help, especially for Lambda: since TCP keep-alive is implemented by the
kernel, it will continue to operate even while the JVM is suspended. This
is why the public documentation advises the use of keepalive:

https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html

On Tue, Aug 5, 2025 at 7:29 PM Ryan Schmitt <rschm...@apache.org> wrote:

> Since rolling out 5.5 (which includes my change for
> `validateAfterInactivity`, which was previously interpreted as a connection
> TTL for async HTTP 1.1 connections), I've been getting reports of
> intermittent request failures which appear to be caused by stale connection
> reuse. The async client is actually much more vulnerable to this issue than
> we realized, and I'm trying to understand why.
>
> First, let me state my understanding of how server-initiated connection
> closure works. A FIN from the server wakes up the selector, which fires an
> input event that is handled by the code in AbstractHttp1StreamDuplexer,
> which "reads" the end-of-stream. This then initiates graceful shutdown of
> the connection. I believe this works by the client closing its end of the
> connection, which is an output event (the client has to send FIN, as well
> as close_notify if TLS). Once this is handled, the duplexer transitions the
> connection state to SHUTDOWN and closes the InternalDataChannel, which (for
> a graceful shutdown) adds itself to the closedSessions queue. The IOReactor
> processes this queue in processClosedSessions(), which will fail any
> requests on the connection and remove it from the pool.
>
> The basic issue seems to be that since connection reuse does not go
> through the IOReactor, there's a race condition between the IOReactor's
> event loop (which drives all the processing and bookkeeping for remote
> connection closure) and request execution (which can draw a doomed
> connection from the connection pool). I wrote a test that sends requests to
> a server, which sends a response and then immediately closes the connection
> (without any `Connection: close` header). Part of the idea here is to
> simulate an async client running in AWS Lambda, where the JVM is suspended
> between invocations: if invocations are intermittent, connections will be
> remotely closed while the JVM is suspended, and then during the next
> invocation the IOReactor will race with the client request in just this way.
>
> What I found with this test setup is that sending requests one at a time
> mostly works, but sometimes results in an exception (typically
> RequestNotExecutedException). Setting a breakpoint on
> PoolEntry::discardConnection, I can see that the client usually detects the
> stale connection (the closure has been processed by the Duplexer):
>
>     at
>> org.apache.hc.core5.pool.PoolEntry.discardConnection(PoolEntry.java:170)
>>     at
>> org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager$InternalConnectionEndpoint.isConnected(PoolingAsyncClientConnectionManager.java:739)
>>     at
>> org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime.isEndpointConnected(InternalHttpAsyncExecRuntime.java:208)
>>     at
>> org.apache.hc.client5.http.impl.async.AsyncConnectExec$1.completed(AsyncConnectExec.java:157)
>>     at
>> org.apache.hc.client5.http.impl.async.AsyncConnectExec$1.completed(AsyncConnectExec.java:153)
>>     at
>> org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime$1.completed(InternalHttpAsyncExecRuntime.java:128)
>>     at
>> org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime$1.completed(InternalHttpAsyncExecRuntime.java:120)
>>     at
>> org.apache.hc.core5.concurrent.BasicFuture.completed(BasicFuture.java:148)
>>     at
>> org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager$3$1.leaseCompleted(PoolingAsyncClientConnectionManager.java:336)
>
>
> But sometimes, the connection closure is not processed in time, in which
> case I see this stack trace, where the IOReactor fails the request and
> discards the conn pool entry:
>
>     at
>> org.apache.hc.core5.pool.PoolEntry.discardConnection(PoolEntry.java:170)
>>     at
>> org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager$InternalConnectionEndpoint.close(PoolingAsyncClientConnectionManager.java:724)
>>     at
>> org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime.discardEndpoint(InternalHttpAsyncExecRuntime.java:148)
>>     at
>> org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime.discardEndpoint(InternalHttpAsyncExecRuntime.java:180)
>>     at
>> org.apache.hc.client5.http.impl.async.InternalAbstractHttpAsyncClient$2.failed(InternalAbstractHttpAsyncClient.java:363)
>>     at
>> org.apache.hc.client5.http.impl.async.AsyncRedirectExec$1.failed(AsyncRedirectExec.java:261)
>>     at
>> org.apache.hc.client5.http.impl.async.ContentCompressionAsyncExec$1.failed(ContentCompressionAsyncExec.java:160)
>>     at
>> org.apache.hc.client5.http.impl.async.AsyncHttpRequestRetryExec$1.failed(AsyncHttpRequestRetryExec.java:203)
>>     at
>> org.apache.hc.client5.http.impl.async.AsyncProtocolExec$1.failed(AsyncProtocolExec.java:297)
>>     at
>> org.apache.hc.client5.http.impl.async.HttpAsyncMainClientExec$1.failed(HttpAsyncMainClientExec.java:135)
>>     at
>> org.apache.hc.core5.http.nio.command.RequestExecutionCommand.failed(RequestExecutionCommand.java:101)
>>     at
>> org.apache.hc.core5.http.nio.command.CommandSupport.cancelCommands(CommandSupport.java:68)
>>     at
>> org.apache.hc.core5.http.impl.nio.AbstractHttp1StreamDuplexer.onDisconnect(AbstractHttp1StreamDuplexer.java:415)
>>     at
>> org.apache.hc.core5.http.impl.nio.AbstractHttp1IOEventHandler.disconnected(AbstractHttp1IOEventHandler.java:95)
>>     at
>> org.apache.hc.core5.http.impl.nio.ClientHttp1IOEventHandler.disconnected(ClientHttp1IOEventHandler.java:41)
>>     at
>> org.apache.hc.core5.reactor.ssl.SSLIOSession$1.disconnected(SSLIOSession.java:253)
>>     at
>> org.apache.hc.core5.reactor.InternalDataChannel.disconnected(InternalDataChannel.java:205)
>>     at
>> org.apache.hc.core5.reactor.SingleCoreIOReactor.processClosedSessions(SingleCoreIOReactor.java:229)
>
>
> Sending multiple requests in parallel (with a connection pool size of 1)
> greatly exacerbates the race condition: _half_ of all requests fail. The
> exceptions vary: RequestNotExecutedException, ConnectionClosedException
> ("Connection closed by peer"), and SocketException (connection reset) are
> all possible, but what reliably happens is that every other request fails
> due to stale (closed) connection reuse. I've reproduced this with HTTP,
> TLSv1.2, and TLSv1.3.
>
> What I'd like to know is:
>
> 1. Can we do anything to improve this race condition? Is there currently
> any blocking work or any async commands that need to run before the conn
> pool entry can be discarded, or the endpoint marked as disconnected?
> 2. Can we do anything to detect stale connection reuse more reliably when
> it does happen? Are there cases where we can surface this as
> a RequestNotExecutedException where we currently aren't?
> 3. Can the async client implement `validateAfterInactivity` for HTTP/1.1
> requests? I think we could actually implement this with a synchronization
> barrier, rather than IO operations: we validate inactive connections by
> waiting for the next iteration of the IOReactor event loop. This would
> allow any backlog of IO events to be processed, preventing the client from
> racing with itself.
>

Re: Stale connection reuse in the async client

Reply via email to