Great! I have added the ticket:

https://issues.apache.org/jira/browse/SOLR-17711

Maybe it is best to continue the discussion there?

From: dev@solr.apache.org At: 03/19/25 20:24:14 UTC-4:00To:  
dev@solr.apache.org,  gus.h...@gmail.com
Subject: Re: Bug in IndexFetcher + Http2SolrClient Interaction

*Excellent* research Luke!

Yes, please file a JIRA and propose a fix.  I also have noticed timeouts
named one thing and applied in a different way.  Very suspicious.

Hey Gus; if memory serves, did you discuss that once?

On Wed, Mar 19, 2025 at 6:28 PM Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) <
lkotzanie...@bloomberg.net> wrote:

> Greetings,
>
> During our rollout of 9.8 we discovered an interesting
> behavior indirectly caused by the Http2SolrClient migration
> of IndexFetcher:
>
>
> 
https://github.com/apache/solr/commit/25194b02caa383feda293490eed6ccbd7c3ecf05#d
iff-7af383a173bd8e05414b341ab08e9ca715b665077112c64150c4db00811d16a6
>
> The change itself does not appear to be the problem, but
> rather the default behavior of Http2SolrClient applying
> the *idle* timeout to the overall request time:
>
>
> 
https://github.com/apache/solr/blob/2b8f933529fa736fe5fd2a9b0c751bedf352f0c7/sol
r/solrj/src/java/org/apache/solr/client/solrj/impl/Http2SolrClient.java#L625-L62
9
>
> Apparently this choice of default has some history:
>
>
> https://github.com/apache/solr/commit/a80eb84d5659a06a10860ad2470e87d80b19fa5d
> +
> in its current form:
>
> https://github.com/apache/solr/commit/d70af456058174d15a25d3c9b8cc5f7a8899b62b
>
> At any rate, in most cases this goes unnoticed because the
> default idle timeout is quite long (120 seconds) but can
> cause problems when applied to something like IndexFetcher
> which is probably *expected* to have sometimes really long-lived,
> healthy connections exceeding the 120s period. An *idle*
> timeout being applied to a long-lived, non-idle connection
> doesn't seem quite right...
>
> We saw this during replication of a 5GB segment which, at
> our bandwidth at the time, exceeded the 120 second time window
> and caused the Cloud to get stuck in a replication loop.
>
> The stacktrace tells the full story, namely in 120 seconds
> we were only able to copy 4.5GB (implied 300 Mbps) and were
> interrupted by this arguably misapplied idle timeout (on a
> clearly non-idle connection!)
>
> 2025-03-17 20:47:15.144 WARN  IndexFetcher [XXXX] ? [XXXX] - Error in
> fetching file: _7rxzv.cfs (downloaded 457179136 of 5529080446 bytes)
> java.io.IOException: java.util.concurrent.TimeoutException: Total timeout
> 120000 ms elapsed
>      at
> 
org.eclipse.jetty.client.util.InputStreamResponseListener$Input.toIOException(In
putStreamResponseListener.java:343)
> ~[jetty-client-10.0.22.jar:10.0.22]
>        at
> 
org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStream
ResponseListener.java:311)
> ~[jetty-client-10.0.22.jar:10.0.22]
>         at 
java.base/java.io.FilterInputStream.read(FilterInputStream.java:133)
> ~[?:?]
>    at
> 
org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.ja
va:87)
> ~[solr-solrj-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user -
> 2025-01-29 16:21:49]
>         at
> org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:138)
> ~[solr-solrj-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user -
> 2025-01-29 16:21:49]
>     at
> 
org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:176)
> ~[solr-solrj-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user -
> 2025-01-29 16:21:49]
>        at
> 
org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:
1849)
> ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user -
> 2025-01-29 16:21:49]
>   at
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetch(IndexFetcher.java:1790)
> ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user -
> 2025-01-29 16:21:49]
>  at
> 
org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:177
2)
> ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user -
> 2025-01-29 16:21:49]
>      at
> 
org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:1192)
> ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user -
> 2025-01-29 16:21:49]
>         at
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:679)
> ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user -
> 2025-01-29 16:21:49]
>    at
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:410)
> ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user -
> 2025-01-29 16:21:49]
>    at
> 
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:522)
> ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user -
> 2025-01-29 16:21:49]
>         at
> org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:243)
> ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user -
> 2025-01-29 16:21:49]
>     at
> 
org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrateg
y.java:697)
> ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user -
> 2025-01-29 16:21:49]
>     at
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:333)
> ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user -
> 2025-01-29 16:21:49]
>    at
> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:309)
> ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user -
> 2025-01-29 16:21:49]
>   at
> 
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(Instru
mentedExecutorService.java:212)
> ~[metrics-core-4.2.26.jar:4.2.26]
>      at
> 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515
)
> ~[?:?]
>       at
> java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
>      at
> 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execu
te$0(ExecutorUtil.java:380)
> ~[solr-solrj-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user -
> 2025-01-29 16:21:49]
>    at
> 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j
ava:1128)
> ~[?:?]
>       at
> 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
java:628)
> ~[?:?]
>       at java.base/java.lang.Thread.run(Thread.java:829) [?:?]
> Caused by: java.util.concurrent.TimeoutException: Total timeout 120000 ms
> elapsed
>  at
> 
org.eclipse.jetty.client.HttpConnection$RequestTimeouts.onExpired(HttpConnection
.java:342)
> ~[jetty-client-10.0.22.jar:10.0.22]
>         at
> 
org.eclipse.jetty.client.HttpConnection$RequestTimeouts.onExpired(HttpConnection
.java:322)
> ~[jetty-client-10.0.22.jar:10.0.22]
>         at 
org.eclipse.jetty.io.CyclicTimeouts.onTimeoutExpired(CyclicTimeouts.java:110)
> ~[jetty-io-10.0.22.jar:10.0.22]
>  at 
org.eclipse.jetty.io.CyclicTimeouts$Timeouts.onTimeoutExpired(CyclicTimeouts.jav
a:197)
> ~[jetty-io-10.0.22.jar:10.0.22]
>         at 
org.eclipse.jetty.io.CyclicTimeout$Wakeup.run(CyclicTimeout.java:294)
> ~[jetty-io-10.0.22.jar:10.0.22]
>  at
> 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515
)
> ~[?:?]
>       at
> java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
>      at
> 
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.r
un(ScheduledThreadPoolExecutor.java:304)
> ~[?:?]
>
> If the community agrees this is a bug I'd be happy to raise a JIRA
> ticket and attempt a fix, which I think should be simple.
>
> What do you all think?
>
> Thanks,
> Luke


Reply via email to