Great! I have added the ticket: https://issues.apache.org/jira/browse/SOLR-17711
Maybe it is best to continue the discussion there? From: dev@solr.apache.org At: 03/19/25 20:24:14 UTC-4:00To: dev@solr.apache.org, gus.h...@gmail.com Subject: Re: Bug in IndexFetcher + Http2SolrClient Interaction *Excellent* research Luke! Yes, please file a JIRA and propose a fix. I also have noticed timeouts named one thing and applied in a different way. Very suspicious. Hey Gus; if memory serves, did you discuss that once? On Wed, Mar 19, 2025 at 6:28 PM Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) < lkotzanie...@bloomberg.net> wrote: > Greetings, > > During our rollout of 9.8 we discovered an interesting > behavior indirectly caused by the Http2SolrClient migration > of IndexFetcher: > > > https://github.com/apache/solr/commit/25194b02caa383feda293490eed6ccbd7c3ecf05#d iff-7af383a173bd8e05414b341ab08e9ca715b665077112c64150c4db00811d16a6 > > The change itself does not appear to be the problem, but > rather the default behavior of Http2SolrClient applying > the *idle* timeout to the overall request time: > > > https://github.com/apache/solr/blob/2b8f933529fa736fe5fd2a9b0c751bedf352f0c7/sol r/solrj/src/java/org/apache/solr/client/solrj/impl/Http2SolrClient.java#L625-L62 9 > > Apparently this choice of default has some history: > > > https://github.com/apache/solr/commit/a80eb84d5659a06a10860ad2470e87d80b19fa5d > + > in its current form: > > https://github.com/apache/solr/commit/d70af456058174d15a25d3c9b8cc5f7a8899b62b > > At any rate, in most cases this goes unnoticed because the > default idle timeout is quite long (120 seconds) but can > cause problems when applied to something like IndexFetcher > which is probably *expected* to have sometimes really long-lived, > healthy connections exceeding the 120s period. An *idle* > timeout being applied to a long-lived, non-idle connection > doesn't seem quite right... > > We saw this during replication of a 5GB segment which, at > our bandwidth at the time, exceeded the 120 second time window > and caused the Cloud to get stuck in a replication loop. > > The stacktrace tells the full story, namely in 120 seconds > we were only able to copy 4.5GB (implied 300 Mbps) and were > interrupted by this arguably misapplied idle timeout (on a > clearly non-idle connection!) > > 2025-03-17 20:47:15.144 WARN IndexFetcher [XXXX] ? [XXXX] - Error in > fetching file: _7rxzv.cfs (downloaded 457179136 of 5529080446 bytes) > java.io.IOException: java.util.concurrent.TimeoutException: Total timeout > 120000 ms elapsed > at > org.eclipse.jetty.client.util.InputStreamResponseListener$Input.toIOException(In putStreamResponseListener.java:343) > ~[jetty-client-10.0.22.jar:10.0.22] > at > org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStream ResponseListener.java:311) > ~[jetty-client-10.0.22.jar:10.0.22] > at java.base/java.io.FilterInputStream.read(FilterInputStream.java:133) > ~[?:?] > at > org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.ja va:87) > ~[solr-solrj-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user - > 2025-01-29 16:21:49] > at > org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:138) > ~[solr-solrj-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user - > 2025-01-29 16:21:49] > at > org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:176) > ~[solr-solrj-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user - > 2025-01-29 16:21:49] > at > org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java: 1849) > ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user - > 2025-01-29 16:21:49] > at > org.apache.solr.handler.IndexFetcher$FileFetcher.fetch(IndexFetcher.java:1790) > ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user - > 2025-01-29 16:21:49] > at > org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:177 2) > ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user - > 2025-01-29 16:21:49] > at > org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:1192) > ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user - > 2025-01-29 16:21:49] > at > org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:679) > ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user - > 2025-01-29 16:21:49] > at > org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:410) > ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user - > 2025-01-29 16:21:49] > at > org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:522) > ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user - > 2025-01-29 16:21:49] > at > org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:243) > ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user - > 2025-01-29 16:21:49] > at > org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrateg y.java:697) > ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user - > 2025-01-29 16:21:49] > at > org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:333) > ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user - > 2025-01-29 16:21:49] > at > org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:309) > ~[solr-core-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user - > 2025-01-29 16:21:49] > at > com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(Instru mentedExecutorService.java:212) > ~[metrics-core-4.2.26.jar:4.2.26] > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515 ) > ~[?:?] > at > java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?] > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execu te$0(ExecutorUtil.java:380) > ~[solr-solrj-9.8.0 d2673aab0d696a2f7330bf3267533525dfad1200 - cloud-user - > 2025-01-29 16:21:49] > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j ava:1128) > ~[?:?] > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor. java:628) > ~[?:?] > at java.base/java.lang.Thread.run(Thread.java:829) [?:?] > Caused by: java.util.concurrent.TimeoutException: Total timeout 120000 ms > elapsed > at > org.eclipse.jetty.client.HttpConnection$RequestTimeouts.onExpired(HttpConnection .java:342) > ~[jetty-client-10.0.22.jar:10.0.22] > at > org.eclipse.jetty.client.HttpConnection$RequestTimeouts.onExpired(HttpConnection .java:322) > ~[jetty-client-10.0.22.jar:10.0.22] > at org.eclipse.jetty.io.CyclicTimeouts.onTimeoutExpired(CyclicTimeouts.java:110) > ~[jetty-io-10.0.22.jar:10.0.22] > at org.eclipse.jetty.io.CyclicTimeouts$Timeouts.onTimeoutExpired(CyclicTimeouts.jav a:197) > ~[jetty-io-10.0.22.jar:10.0.22] > at org.eclipse.jetty.io.CyclicTimeout$Wakeup.run(CyclicTimeout.java:294) > ~[jetty-io-10.0.22.jar:10.0.22] > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515 ) > ~[?:?] > at > java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?] > at > java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.r un(ScheduledThreadPoolExecutor.java:304) > ~[?:?] > > If the community agrees this is a bug I'd be happy to raise a JIRA > ticket and attempt a fix, which I think should be simple. > > What do you all think? > > Thanks, > Luke