[jira] [Updated] (SOLR-17711) IndexFetcher Incorrectly Timing Out

Luke Kot-Zaniewski (Jira) Fri, 16 May 2025 17:31:32 -0700


     [ 
https://issues.apache.org/jira/browse/SOLR-17711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Luke Kot-Zaniewski updated SOLR-17711:
--------------------------------------
    Description: 
During our rollout of 9.8 we discovered an interesting
behavior indirectly caused by the Http2SolrClient migration
of IndexFetcher:
 
https://github.com/apache/solr/commit/e62cb1b0066132347589b5e5ca38f38fd5e668d0
 
The change itself does not appear to be the problem, but
rather the default behavior of Http2SolrClient applying
the *idle* timeout to the overall request time:
 
[https://github.com/apache/solr/blob/2b8f933529fa736fe5fd2a9b0c751bedf352f0c7/solr/solrj/src/java/org/apache/solr/client/solrj/impl/Http2SolrClient.java#L625-L629]
 
Apparently this choice of default has some history:
 
[https://github.com/apache/solr/commit/a80eb84d5659a06a10860ad2470e87d80b19fa5d]
+
in its current form:
[https://github.com/apache/solr/commit/d70af456058174d15a25d3c9b8cc5f7a8899b62b]
 
At any rate, in most cases this goes unnoticed because the
default idle timeout is quite long (120 seconds) but can
cause problems when applied to something like IndexFetcher
which is probably *expected* to have sometimes really long-lived,
healthy connections exceeding the 120s period. An *idle*
timeout being applied to a long-lived, non-idle connection
doesn't seem quite right...
 
We saw this during replication of a 5GB segment which, at
our bandwidth at the time, exceeded the 120 second time window
and caused the Cloud to get stuck in a replication loop.

  was:
During our rollout of 9.8 we discovered an interesting
behavior indirectly caused by the Http2SolrClient migration
of IndexFetcher:
 
[https://github.com/apache/solr/commit/25194b02caa383feda293490eed6ccbd7c3ecf05#diff-7af383a173bd8e05414b341ab08e9ca715b665077112c64150c4db00811d16a6]
 
The change itself does not appear to be the problem, but
rather the default behavior of Http2SolrClient applying
the *idle* timeout to the overall request time:
 
[https://github.com/apache/solr/blob/2b8f933529fa736fe5fd2a9b0c751bedf352f0c7/solr/solrj/src/java/org/apache/solr/client/solrj/impl/Http2SolrClient.java#L625-L629]
 
Apparently this choice of default has some history:
 
[https://github.com/apache/solr/commit/a80eb84d5659a06a10860ad2470e87d80b19fa5d]
+
in its current form:
[https://github.com/apache/solr/commit/d70af456058174d15a25d3c9b8cc5f7a8899b62b]
 
At any rate, in most cases this goes unnoticed because the
default idle timeout is quite long (120 seconds) but can
cause problems when applied to something like IndexFetcher
which is probably *expected* to have sometimes really long-lived,
healthy connections exceeding the 120s period. An *idle*
timeout being applied to a long-lived, non-idle connection
doesn't seem quite right...
 
We saw this during replication of a 5GB segment which, at
our bandwidth at the time, exceeded the 120 second time window
and caused the Cloud to get stuck in a replication loop.


> IndexFetcher Incorrectly Timing Out
> -----------------------------------
>
>                 Key: SOLR-17711
>                 URL: https://issues.apache.org/jira/browse/SOLR-17711
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 9.8
>            Reporter: Luke Kot-Zaniewski
>            Priority: Major
>
> During our rollout of 9.8 we discovered an interesting
> behavior indirectly caused by the Http2SolrClient migration
> of IndexFetcher:
>  
> https://github.com/apache/solr/commit/e62cb1b0066132347589b5e5ca38f38fd5e668d0
>  
> The change itself does not appear to be the problem, but
> rather the default behavior of Http2SolrClient applying
> the *idle* timeout to the overall request time:
>  
> [https://github.com/apache/solr/blob/2b8f933529fa736fe5fd2a9b0c751bedf352f0c7/solr/solrj/src/java/org/apache/solr/client/solrj/impl/Http2SolrClient.java#L625-L629]
>  
> Apparently this choice of default has some history:
>  
> [https://github.com/apache/solr/commit/a80eb84d5659a06a10860ad2470e87d80b19fa5d]
> +
> in its current form:
> [https://github.com/apache/solr/commit/d70af456058174d15a25d3c9b8cc5f7a8899b62b]
>  
> At any rate, in most cases this goes unnoticed because the
> default idle timeout is quite long (120 seconds) but can
> cause problems when applied to something like IndexFetcher
> which is probably *expected* to have sometimes really long-lived,
> healthy connections exceeding the 120s period. An *idle*
> timeout being applied to a long-lived, non-idle connection
> doesn't seem quite right...
>  
> We saw this during replication of a 5GB segment which, at
> our bandwidth at the time, exceeded the 120 second time window
> and caused the Cloud to get stuck in a replication loop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Updated] (SOLR-17711) IndexFetcher Incorrectly Timing Out

Reply via email to