[
https://issues.apache.org/jira/browse/SOLR-18188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18091066#comment-18091066
]
David Smiley commented on SOLR-18188:
-------------------------------------
I observed similarly when you posted. I gave your message to Claude Opus 4.6
which gave the following analysis:
----
The root cause is that ClosedChannelException is not retryable for UPDATE
requests in LBSolrClient.
Here's the chain:
1. Both tests close a proxy for a non-leader replica, then send updates via
cloudClient without retry logic. RecoveryAfterSoftCommitTest calls
cloudClient.add() directly
(line 106); ReplicationFactorTest calls sendDoc() → cloudClient.request() (line
495). Neither has retry-on-failure.
2. RandomizingCloudSolrClientBuilder randomizes shardLeadersOnly. When
randomized to false, the CloudSolrClient falls through to LBSolrClient for
routing, which can pick
the partitioned replica's endpoint.
3. LBSolrClient treats UPDATEs as non-retryable (LBSolrClient.java:575). For
non-retryable requests, it only retries on "connect exceptions" —
ConnectException or
HttpConnectTimeoutException (isConnectException() at line 686).
ClosedChannelException doesn't qualify, so the request is thrown, not retried
on the next server
(line 677).
4. The old Apache HttpClient (HTTP/1.1) masked this. When trying to reach a
closed proxy, HTTP/1.1 created a new TCP connection → got ConnectException
(retryable). HTTP/2
reuses a persistent multiplexed session → the dead session throws
ClosedChannelException (not retryable). Same randomization, different exception
type, different retry
behavior.
Why these two tests and not others that also close proxies:
- ForceLeaderTest.sendDoc() wraps in sendDocsWithRetry() — catches all
exceptions
- HttpPartitionTest sends docs to specific Jettys directly, not through
cloudClient
- Tests that close ALL proxies don't care which replica the LB picks — they
all fail equally
Possible fix directions:
- Add ClosedChannelException (or broader IOException on dead sessions) to
isConnectException() in LBSolrClient — this is the proper product-level fix
since a dead session
is semantically equivalent to a connect failure
- Or add retry logic in these two tests (test-level workaround)
The JettySolrRunner proxy ordering change from the June 5 commit (8c49326e71b)
is a red herring — it only affects JettySolrRunner.stop(), not test-level
proxy.close()
calls.
----
I'm very skeptical that ClosedChannelException can be considered to only occur
for a connection opening, which is what
{{org.apache.solr.client.solrj.impl.LBSolrClient#isConnectException}} is
supposed to be limited to. I'm inclined to simply set {{solr.http1=true}} for
this test, perhaps limited to the specific CloudSolrClient used, if possible.
WDYT?
> solr-test-framework: Remove Apache HttpClient usages
> ----------------------------------------------------
>
> Key: SOLR-18188
> URL: https://issues.apache.org/jira/browse/SOLR-18188
> Project: Solr
> Issue Type: Task
> Components: test-framework, Tests
> Reporter: David Smiley
> Assignee: David Smiley
> Priority: Major
> Labels: pull-request-available
> Fix For: 10.1
>
> Time Spent: 6h 50m
> Remaining Estimate: 0h
>
> As of this writing, the last usages of Apache HttpClient are in Solr's tests.
> This issue aims to remove it completely. But it's a lot of work.
> Some possible steps:
> * Remove tests for our HttpSolrClient & friends (fundamentally based on
> Apache HttpClient)
> * Replace usages of HttpSolrClient.getHttpClient with
> HttpJettySolrClient.getHttpClient
> * Replace usages of HttpSolrClient.getBaseURL by introducing a new base
> client that has this method. Or access similarly from Jetty when easily
> available.
> * of course, stop using HttpSolrClient & friends. Maybe class-by-class.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]