I believe before IndexFetcher was creating HttpSolrClient using default setting for timeouts, I will confirm it once.
DEFAULT_CONNECT_TIMEOUT = 60000; DEFAULT_SO_TIMEOUT = 600000; More Info here :- solr/solrj/src/java/org/apache/solr/client/solrj/impl/HttpClientUtil.java On Thu, Mar 20, 2025 at 5:56 AM David Smiley <dsmi...@apache.org> wrote: > *Excellent* research Luke! > > Yes, please file a JIRA and propose a fix. I also have noticed timeouts > named one thing and applied in a different way. Very suspicious. > > Hey Gus; if memory serves, did you discuss that once? > > On Wed, Mar 19, 2025 at 6:28 PM Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) < > lkotzanie...@bloomberg.net> wrote: > > > Greetings, > > > > During our rollout of 9.8 we discovered an interesting > > behavior indirectly caused by the Http2SolrClient migration > > of IndexFetcher: > > > > > > > https://github.com/apache/solr/commit/25194b02caa383feda293490eed6ccbd7c3ecf05#diff-7af383a173bd8e05414b341ab08e9ca715b665077112c64150c4db00811d16a6 > > > > The change itself does not appear to be the problem, but > > rather the default behavior of Http2SolrClient applying > > the *idle* timeout to the overall request time: > > > > > > > https://github.com/apache/solr/blob/2b8f933529fa736fe5fd2a9b0c751bedf352f0c7/solr/solrj/src/java/org/apache/solr/client/solrj/impl/Http2SolrClient.java#L625-L629 > > > > Apparently this choice of default has some history: > > > > > > > https://github.com/apache/solr/commit/a80eb84d5659a06a10860ad2470e87d80b19fa5d > > + > > in its current form: > > > > > https://github.com/apache/solr/commit/d70af456058174d15a25d3c9b8cc5f7a8899b62b > > > > At any rate, in most cases this goes unnoticed because the > > default idle timeout is quite long (120 seconds) but can > > cause problems when applied to something like IndexFetcher > > which is probably *expected* to have sometimes really long-lived, > > healthy connections exceeding the 120s period. An *idle* > > timeout being applied to a long-lived, non-idle connection > > doesn't seem quite right... > > > > We saw this during replication of a 5GB segment which, at > > our bandwidth at the time, exceeded the 120 second time window > > and caused the Cloud to get stuck in a replication loop. > > > > The stacktrace tells the full story, namely in 120 seconds > > we were only able to copy 4.5GB (implied 300 Mbps) and were > > interrupted by this arguably misapplied idle timeout (on a > > clearly non-idle connection!) > > > > 2025-03-17 20:47:15.144 WARN IndexFetcher [XXXX] ? [XXXX] - Error in > > fetching file: _7rxzv.cfs (downloaded 457179136 of 5529080446 bytes) > > java.io.IOException: java.util.concurrent.TimeoutException: Total timeout > > 120000 ms elapsed > > at > > > org.eclipse.jetty.client.util.InputStreamResponseListener$Input.toIOException(InputStreamResponseListener.java:343) > > ~[jetty-client-10.0.22.jar:10.0.22] > > at > > > org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:311) > > ~[jetty-client-10.0.22.jar:10.0.22] > > at java.base/java.io > .FilterInputStream.read(FilterInputStream.java:133) > > ~[?:?] > > at > > > org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:87) > > ~[solr-solrj-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user > - > > 2025-01-29 16:21:49] > > at > > > org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:138) > > ~[solr-solrj-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user > - > > 2025-01-29 16:21:49] > > at > > > org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:176) > > ~[solr-solrj-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user > - > > 2025-01-29 16:21:49] > > at > > > org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1849) > > ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user - > > 2025-01-29 16:21:49] > > at > > > org.apache.solr.handler.IndexFetcher$FileFetcher.fetch(IndexFetcher.java:1790) > > ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user - > > 2025-01-29 16:21:49] > > at > > > org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1772) > > ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user - > > 2025-01-29 16:21:49] > > at > > > org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:1192) > > ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user - > > 2025-01-29 16:21:49] > > at > > > org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:679) > > ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user - > > 2025-01-29 16:21:49] > > at > > > org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:410) > > ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user - > > 2025-01-29 16:21:49] > > at > > > org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:522) > > ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user - > > 2025-01-29 16:21:49] > > at > > > org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:243) > > ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user - > > 2025-01-29 16:21:49] > > at > > > org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:697) > > ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user - > > 2025-01-29 16:21:49] > > at > > > org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:333) > > ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user - > > 2025-01-29 16:21:49] > > at > > org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:309) > > ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user - > > 2025-01-29 16:21:49] > > at > > > com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:212) > > ~[metrics-core-4.2.26.jar:4.2.26] > > at > > > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) > > ~[?:?] > > at > > java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?] > > at > > > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:380) > > ~[solr-solrj-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user > - > > 2025-01-29 16:21:49] > > at > > > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > > ~[?:?] > > at > > > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > > ~[?:?] > > at java.base/java.lang.Thread.run(Thread.java:829) [?:?] > > Caused by: java.util.concurrent.TimeoutException: Total timeout 120000 ms > > elapsed > > at > > > org.eclipse.jetty.client.HttpConnection$RequestTimeouts.onExpired(HttpConnection.java:342) > > ~[jetty-client-10.0.22.jar:10.0.22] > > at > > > org.eclipse.jetty.client.HttpConnection$RequestTimeouts.onExpired(HttpConnection.java:322) > > ~[jetty-client-10.0.22.jar:10.0.22] > > at org.eclipse.jetty.io > .CyclicTimeouts.onTimeoutExpired(CyclicTimeouts.java:110) > > ~[jetty-io-10.0.22.jar:10.0.22] > > at org.eclipse.jetty.io > .CyclicTimeouts$Timeouts.onTimeoutExpired(CyclicTimeouts.java:197) > > ~[jetty-io-10.0.22.jar:10.0.22] > > at org.eclipse.jetty.io > .CyclicTimeout$Wakeup.run(CyclicTimeout.java:294) > > ~[jetty-io-10.0.22.jar:10.0.22] > > at > > > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) > > ~[?:?] > > at > > java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?] > > at > > > java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) > > ~[?:?] > > > > If the community agrees this is a bug I'd be happy to raise a JIRA > > ticket and attempt a fix, which I think should be simple. > > > > What do you all think? > > > > Thanks, > > Luke >