[ 
https://issues.apache.org/jira/browse/SOLR-13389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16814714#comment-16814714
 ] 

Hoss Man commented on SOLR-13389:
---------------------------------

First off, some notes on what I found when I went digging...
 * {{SolrClientBuilder}} has a hardcoded default value of {{protected Integer 
socketTimeoutMillis = 120000}} ... aka: *2 minutes*
 ** this serves as the default value for all of the SolrClient impls with 
Builders that subclass it: HttpSolrClient, ConcurrentUpdateSolrClient, 
LBHttpSolrClient, CloudSolrClient.
 ** *BUT* aproximately half of the places in solr/core that use a 
SolrClient.Builder, override this default via {{withSocketTimeout(...)}}
 *** in some cases hardcoded values are used that vary from as little as *1 
second* (RecoveryStrategy) to as much as *2 minutes* 
(OverseerCollectionMessageHandler, SyncStrategy & IndexFetcher)

 * {{HttpClientUtil}} defines a {{public static final int DEFAULT_SO_TIMEOUT = 
600000}} ... aka: *10 minutes*
 ** this constant is used in a few {{HttpClientUtil}} helper methods:
 *** hardcoded value when {{HttpClientUtil.createDefaultRequestConfigBuilder()}}
 *** default when using {{HttpClientUtil.createClient(SolrParams...)}}
 ** this constant is also used as the default value for {{Http2SolrClient}}
 ** This efective value (600000ms) is also specified as the "soft coded" 
default in the solr.xml for both intranode related options:
 *** {{<shardHandlerFactory>}}: {{<int 
name="socketTimeout">${socketTimeout:600000}</int>}}
 *** {{<solrcloud>}}: {{<int 
name="distribUpdateSoTimeout">${distribUpdateSoTimeout:600000}</int>}}
 *** this constant is also used as the "hard coded" default in the 
corresponding java code if solr.xml doesn't contain those optional settings

 * In our _test_ solr.xml files, the "soft coded" default 
socketTimeout/distribUpdateSoTimeout values vary significantly from the version 
of solr.xml we give to users...
 ** {{test-files/solr/solr.xml}}
 *** {{socketTimeout}} = *15 seconds*
 *** {{distribUpdateSoTimeout}} = *30 seconds*
 *** NOTE: this is the file used by most "legacy" distributed/zk tests
 ** {{test-files/solr/solr-solrreporter.xml}} & {{.../solr-jmxreporter.xml}} & 
{{.../solr-trackingshardhandler.xml}} & {{MiniSolrCloudCluster}} 's inline 
{{DEFAULT_CLOUD_SOLR_XML}}
 *** {{socketTimeout}} = *90 seconds*
 *** {{distribUpdateSoTimeout}} = *340 seconds* (5 minutes 40 seconds ?!?!)
 ** {{test-files/solr/solr-50-all.xml}} ... only used for config parsing 
tests...
 *** {{socketTimeout}} = *100 ms*
 *** {{distribUpdateSoTimeout}} = *33 ms*
 ** {{test-files/solr/solr-stress-new.xml}} ... only used for some Zk file 
management tests??? ...
 *** {{socketTimeout}} = *90 seconds*
 *** {{distribUpdateSoTimeout}} ... left to default
 ** In ~10 test (base) classes these "soft coded" defaults are overridden via 
{{System.setProperty(....)}} using values varying from as little as *3 seconds* 
(PeerSyncReplicationTest) to a max of *90 seconds* (HdfsTestUtil)
 *** I suspect – but have not dug deep into the archives to verify – that in 
most cases these were probably adhoc attempts to overcome timeout related 
failures seen in the wild, but since then the "global defaults" have been 
changed to be much higher (in some cases higher then what the individual tests 
hardcode)

 * In our test _code_, tests that use {{SolrClients}} to communicate with solr 
use a variety of socket timeouts...
 ** a handful of tests construct the SolrClient (Builders) themselves, and use 
values ranging from *30 seconds* to *90 seconds*
 ** tests using {{MiniSolrCloudCluster.getSolrClient()}} get a hardcoded *90 
seconds*
 ** tests using the various 
{{SolrTestCaseJ4.get(Http|LBHttp|Cloud|ConcurrentUpdate|CloudHttp2)SolrClient(...)}}
 methods get the default from the corrisponding Builder ... unless they 
override it...
 *** ...which a handful of tests do, frequently w/o any clear pattern or 
obvious reasons (or comments) and usually using values _lower_ then the default 
behavior of the client
 *** I suspect – but have not dug deep into the archives to verify – that in 
most cases these were probably adhoc attempts to overcome timeout related 
failures seen in the wild, but since then the "global defaults" have been 
changed to be much higher (in some cases higher then what the individual tests 
hardcode)

----
w/o (or at least: "before") bikesheeding over the specific numeric values to be 
used (which for now i'll just refer to as {{$VARIABLES}}, i'd like to 
propose/discuss whether it makes sense to establish in principle that the 
following statements should hold true:
 # All SolrClient (both HTTP1 and HTTP2) iplementations should use the same 
default unless the user explicitly configures (ie: 
{{$CLIENT_SOCKET_TIMEOUT_DEFAULT}})
 # the hardcoded and softcoded default values specified in solr.xml for 
intranode communication should be *at least* as big as the SolrClient default 
(ie: {{$INTRANODE_SOCKET_TIMEOUT_DEFAULT >= $CLIENT_SOCKET_TIMEOUT_DEFAULT}} )
 # the ~25 or so places in solr/core/src/java that construct SolrClients 
(Builders) explicitly, should either:
 ** _explicitly_ use one of the configured intranode values (ie: default to 
{{$INTRANODE_SOCKET_TIMEOUT_DEFAULT}}) if the purpose of the client is 
intranode communication (which i assume they all are in one form or another – 
the only question is which of the two configured options should they use? 
update or query?
 ** ... otherwise: _implicitly_ use {{$CLIENT_SOCKET_TIMEOUT_DEFAULT}}
 # all SolrClient "helper" methods that in the test-framework should create 
SolrClients w/timeouts *at least* as large as the client default (ie: 
{{$TEST_CLIENT_SOCKET_TIMEOUT_DEFAULT >= $CLIENT_SOCKET_TIMEOUT_DEFAULT}} ) 
unless the caller specifies an explicit value
 ** no solr test should specify an explicit value & we should deprecate the 
SolrTestCaseJ4 methods that allow an explicit value
 ** any existing solr (or third party) test that has a strong need to 
explicitly specify a value should create a SolrClient.Builder explicitly.
 # all test solr.xml files should have configured intranode values *at least* 
as large as the values included in the production solr.xml (ie: 
{{$TEST_INTRANODE_SOCKET_TIMEOUT_DEFAULT >= $INTRANODE_SOCKET_TIMEOUT_DEFAULT}} 
)
 ** except where needed for the config parsing tests, our test configs should 
not include sys prop overrides – ie: no solr test should be able to use 
{{System.setProperty(....)}} to override these configured 
{{$TEST_INTRANODE_SOCKET_TIMEOUT_DEFAULT}} value. If it's too low for some 
tests, it's probably too low for other tests and should be increased across the 
board.

Are these principles things we can all agree on?
----
On point I'd like to proactively elaborate on...

My reasoning for calling out the idea of 
{{$TEST_CLIENT_SOCKET_TIMEOUT_DEFAULT}} and 
{{$TEST_INTRANODE_SOCKET_TIMEOUT_DEFAULT}} concepts as being distinct from – 
and in my opinion,should be _greater than_ – the corresponding 
hardcoded/softcoded defaults in the solrj client code & production solr.xml 
files, is based on the impression that it makes sense for these values to be 
larger when running tests. The reason being: simulating multiple solr clients & 
servers concurrently on a single laptop/jenkins-vm is intensive, and it seems 
like we should be more forgiving and willing to wait at least as longer for 
these responses then we might suggest/recommend a production solr client/server 
should wait.

However i should point out that this reasoning is in direct conflict with a 
comments introduced very recently in the commonly used 
{{test-files/solr/solr.xml}} where the intranode socket timeout values are much 
lower then in the solr.xml we provide end users...
{noformat}
    <int name="socketTimeout">${socketTimeout:15000}</int>
...
    <int name="distribUpdateSoTimeout">${distribUpdateSoTimeout:30000}</int> 
<!-- We are running tests - the default should be low, not like production -->

{noformat}
[[email protected]]: you seem to be the origin of that comment, can you 
elaborate on why you feel that it makes sense that the 
{{distribUpdateSoTimeout}} in tests should be so much lower then the value in 
the solr.xml files we ship to users? (10 minutes vs 30 seconds) ... because it 
seems counter intuitive to me (for the reasons mentioned above).

> rectify discrepencies in socket (and connect) timeout values used throughout 
> the code and tests - probably helping to reduce TimeoutExceptions in tests
> -------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-13389
>                 URL: https://issues.apache.org/jira/browse/SOLR-13389
>             Project: Solr
>          Issue Type: Task
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Assignee: Hoss Man
>            Priority: Major
>
> While looking into some jenkins test failures caused by distributed requests 
> that timeout, i realized that the "socket timeout" aka "idle timeout" aka 
> "SO_TIMEOUT" values used in various places in the code & sample configs can 
> vary significantly, and in the case of *test* configs/code can differ from 
> the default / production configs by an order of magnitude.
> I think we should consider rectifying some of the various places/ways that 
> different values are sprinkled through out the code to reduce the number of 
> (different) places we have magic constants.  I believe a large number of 
> jenkins test failures we currently see due to timeout exceptions are simply 
> because tests (or test configs) override sensible defaults w/values that are 
> too low to be useful.
> (NOTE: all of these problems / discrepancies also apply to "connect timeout" 
> which should probably be addressed at the same time, but for now i'm focusing 
> on the "socket timeout" since it seems to be the bigger problem in jenkins 
> failures -- if we reach consensus on standardizing some values across the 
> board the same approach can be made to connect timeouts at the same time)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to