[
https://issues.apache.org/jira/browse/SOLR-13389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16814714#comment-16814714
]
Hoss Man commented on SOLR-13389:
---------------------------------
First off, some notes on what I found when I went digging...
* {{SolrClientBuilder}} has a hardcoded default value of {{protected Integer
socketTimeoutMillis = 120000}} ... aka: *2 minutes*
** this serves as the default value for all of the SolrClient impls with
Builders that subclass it: HttpSolrClient, ConcurrentUpdateSolrClient,
LBHttpSolrClient, CloudSolrClient.
** *BUT* aproximately half of the places in solr/core that use a
SolrClient.Builder, override this default via {{withSocketTimeout(...)}}
*** in some cases hardcoded values are used that vary from as little as *1
second* (RecoveryStrategy) to as much as *2 minutes*
(OverseerCollectionMessageHandler, SyncStrategy & IndexFetcher)
* {{HttpClientUtil}} defines a {{public static final int DEFAULT_SO_TIMEOUT =
600000}} ... aka: *10 minutes*
** this constant is used in a few {{HttpClientUtil}} helper methods:
*** hardcoded value when {{HttpClientUtil.createDefaultRequestConfigBuilder()}}
*** default when using {{HttpClientUtil.createClient(SolrParams...)}}
** this constant is also used as the default value for {{Http2SolrClient}}
** This efective value (600000ms) is also specified as the "soft coded"
default in the solr.xml for both intranode related options:
*** {{<shardHandlerFactory>}}: {{<int
name="socketTimeout">${socketTimeout:600000}</int>}}
*** {{<solrcloud>}}: {{<int
name="distribUpdateSoTimeout">${distribUpdateSoTimeout:600000}</int>}}
*** this constant is also used as the "hard coded" default in the
corresponding java code if solr.xml doesn't contain those optional settings
* In our _test_ solr.xml files, the "soft coded" default
socketTimeout/distribUpdateSoTimeout values vary significantly from the version
of solr.xml we give to users...
** {{test-files/solr/solr.xml}}
*** {{socketTimeout}} = *15 seconds*
*** {{distribUpdateSoTimeout}} = *30 seconds*
*** NOTE: this is the file used by most "legacy" distributed/zk tests
** {{test-files/solr/solr-solrreporter.xml}} & {{.../solr-jmxreporter.xml}} &
{{.../solr-trackingshardhandler.xml}} & {{MiniSolrCloudCluster}} 's inline
{{DEFAULT_CLOUD_SOLR_XML}}
*** {{socketTimeout}} = *90 seconds*
*** {{distribUpdateSoTimeout}} = *340 seconds* (5 minutes 40 seconds ?!?!)
** {{test-files/solr/solr-50-all.xml}} ... only used for config parsing
tests...
*** {{socketTimeout}} = *100 ms*
*** {{distribUpdateSoTimeout}} = *33 ms*
** {{test-files/solr/solr-stress-new.xml}} ... only used for some Zk file
management tests??? ...
*** {{socketTimeout}} = *90 seconds*
*** {{distribUpdateSoTimeout}} ... left to default
** In ~10 test (base) classes these "soft coded" defaults are overridden via
{{System.setProperty(....)}} using values varying from as little as *3 seconds*
(PeerSyncReplicationTest) to a max of *90 seconds* (HdfsTestUtil)
*** I suspect – but have not dug deep into the archives to verify – that in
most cases these were probably adhoc attempts to overcome timeout related
failures seen in the wild, but since then the "global defaults" have been
changed to be much higher (in some cases higher then what the individual tests
hardcode)
* In our test _code_, tests that use {{SolrClients}} to communicate with solr
use a variety of socket timeouts...
** a handful of tests construct the SolrClient (Builders) themselves, and use
values ranging from *30 seconds* to *90 seconds*
** tests using {{MiniSolrCloudCluster.getSolrClient()}} get a hardcoded *90
seconds*
** tests using the various
{{SolrTestCaseJ4.get(Http|LBHttp|Cloud|ConcurrentUpdate|CloudHttp2)SolrClient(...)}}
methods get the default from the corrisponding Builder ... unless they
override it...
*** ...which a handful of tests do, frequently w/o any clear pattern or
obvious reasons (or comments) and usually using values _lower_ then the default
behavior of the client
*** I suspect – but have not dug deep into the archives to verify – that in
most cases these were probably adhoc attempts to overcome timeout related
failures seen in the wild, but since then the "global defaults" have been
changed to be much higher (in some cases higher then what the individual tests
hardcode)
----
w/o (or at least: "before") bikesheeding over the specific numeric values to be
used (which for now i'll just refer to as {{$VARIABLES}}, i'd like to
propose/discuss whether it makes sense to establish in principle that the
following statements should hold true:
# All SolrClient (both HTTP1 and HTTP2) iplementations should use the same
default unless the user explicitly configures (ie:
{{$CLIENT_SOCKET_TIMEOUT_DEFAULT}})
# the hardcoded and softcoded default values specified in solr.xml for
intranode communication should be *at least* as big as the SolrClient default
(ie: {{$INTRANODE_SOCKET_TIMEOUT_DEFAULT >= $CLIENT_SOCKET_TIMEOUT_DEFAULT}} )
# the ~25 or so places in solr/core/src/java that construct SolrClients
(Builders) explicitly, should either:
** _explicitly_ use one of the configured intranode values (ie: default to
{{$INTRANODE_SOCKET_TIMEOUT_DEFAULT}}) if the purpose of the client is
intranode communication (which i assume they all are in one form or another –
the only question is which of the two configured options should they use?
update or query?
** ... otherwise: _implicitly_ use {{$CLIENT_SOCKET_TIMEOUT_DEFAULT}}
# all SolrClient "helper" methods that in the test-framework should create
SolrClients w/timeouts *at least* as large as the client default (ie:
{{$TEST_CLIENT_SOCKET_TIMEOUT_DEFAULT >= $CLIENT_SOCKET_TIMEOUT_DEFAULT}} )
unless the caller specifies an explicit value
** no solr test should specify an explicit value & we should deprecate the
SolrTestCaseJ4 methods that allow an explicit value
** any existing solr (or third party) test that has a strong need to
explicitly specify a value should create a SolrClient.Builder explicitly.
# all test solr.xml files should have configured intranode values *at least*
as large as the values included in the production solr.xml (ie:
{{$TEST_INTRANODE_SOCKET_TIMEOUT_DEFAULT >= $INTRANODE_SOCKET_TIMEOUT_DEFAULT}}
)
** except where needed for the config parsing tests, our test configs should
not include sys prop overrides – ie: no solr test should be able to use
{{System.setProperty(....)}} to override these configured
{{$TEST_INTRANODE_SOCKET_TIMEOUT_DEFAULT}} value. If it's too low for some
tests, it's probably too low for other tests and should be increased across the
board.
Are these principles things we can all agree on?
----
On point I'd like to proactively elaborate on...
My reasoning for calling out the idea of
{{$TEST_CLIENT_SOCKET_TIMEOUT_DEFAULT}} and
{{$TEST_INTRANODE_SOCKET_TIMEOUT_DEFAULT}} concepts as being distinct from –
and in my opinion,should be _greater than_ – the corresponding
hardcoded/softcoded defaults in the solrj client code & production solr.xml
files, is based on the impression that it makes sense for these values to be
larger when running tests. The reason being: simulating multiple solr clients &
servers concurrently on a single laptop/jenkins-vm is intensive, and it seems
like we should be more forgiving and willing to wait at least as longer for
these responses then we might suggest/recommend a production solr client/server
should wait.
However i should point out that this reasoning is in direct conflict with a
comments introduced very recently in the commonly used
{{test-files/solr/solr.xml}} where the intranode socket timeout values are much
lower then in the solr.xml we provide end users...
{noformat}
<int name="socketTimeout">${socketTimeout:15000}</int>
...
<int name="distribUpdateSoTimeout">${distribUpdateSoTimeout:30000}</int>
<!-- We are running tests - the default should be low, not like production -->
{noformat}
[[email protected]]: you seem to be the origin of that comment, can you
elaborate on why you feel that it makes sense that the
{{distribUpdateSoTimeout}} in tests should be so much lower then the value in
the solr.xml files we ship to users? (10 minutes vs 30 seconds) ... because it
seems counter intuitive to me (for the reasons mentioned above).
> rectify discrepencies in socket (and connect) timeout values used throughout
> the code and tests - probably helping to reduce TimeoutExceptions in tests
> -------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: SOLR-13389
> URL: https://issues.apache.org/jira/browse/SOLR-13389
> Project: Solr
> Issue Type: Task
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Hoss Man
> Assignee: Hoss Man
> Priority: Major
>
> While looking into some jenkins test failures caused by distributed requests
> that timeout, i realized that the "socket timeout" aka "idle timeout" aka
> "SO_TIMEOUT" values used in various places in the code & sample configs can
> vary significantly, and in the case of *test* configs/code can differ from
> the default / production configs by an order of magnitude.
> I think we should consider rectifying some of the various places/ways that
> different values are sprinkled through out the code to reduce the number of
> (different) places we have magic constants. I believe a large number of
> jenkins test failures we currently see due to timeout exceptions are simply
> because tests (or test configs) override sensible defaults w/values that are
> too low to be useful.
> (NOTE: all of these problems / discrepancies also apply to "connect timeout"
> which should probably be addressed at the same time, but for now i'm focusing
> on the "socket timeout" since it seems to be the bigger problem in jenkins
> failures -- if we reach consensus on standardizing some values across the
> board the same approach can be made to connect timeouts at the same time)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]