Suril Shah created SOLR-13532:
---------------------------------
Summary: Unable to start core recovery due to timeout in ping
request
Key: SOLR-13532
URL: https://issues.apache.org/jira/browse/SOLR-13532
Project: Solr
Issue Type: Bug
Security Level: Public (Default Security Level. Issues are Public)
Components: SolrCloud
Affects Versions: 7.6
Reporter: Suril Shah
Discovered following issue with the core recovery:
* Core recovery is not being initialized and throwing following exception
message :
{code:java}
2019-06-07 00:53:12.436 INFO
(recoveryExecutor-4-thread-1-processing-n:<solr_ip>:8983_solr
x:<collection_name>_shard41_replica_n2777 c:<collection_name> s:shard41
r:core_node2778) x:<collection_name>_shard41_replica_n2777
o.a.s.c.RecoveryStrategy Failed to connect leader http://<solr_ip>:8983/solr on
recovery, try again{code}
* Above error occurs when ping request takes time more than a timeout period
which is hard-coded to one second in solr source code. However In a general
production setting it is common to have ping time more than one second, hence,
the core recovery never starts and exception is thrown.
* Also the other major concern is that this exception is logged as an info
message, hence it is very difficult to identify the error if info logging is
not enabled.
* Please refer to following code snippet from the [source
code|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L789-L803]
to understand the above issue.
{code:java}
try (HttpSolrClient httpSolrClient = new
HttpSolrClient.Builder(leaderReplica.getCoreUrl())
.withSocketTimeout(1000)
.withConnectionTimeout(1000)
.withHttpClient(cc.getUpdateShardHandler().getRecoveryOnlyHttpClient())
.build()) {
SolrPingResponse resp = httpSolrClient.ping();
return leaderReplica;
} catch (IOException e) {
log.info("Failed to connect leader {} on recovery, try again",
leaderReplica.getBaseUrl());
Thread.sleep(500);
} catch (Exception e) {
if (e.getCause() instanceof IOException) {
log.info("Failed to connect leader {} on recovery, try again",
leaderReplica.getBaseUrl());
Thread.sleep(500);
}
{code}
The above issue will have high impact in production level clusters, since cores
not being able to recover may lead to data loss.
Following improvements would be really helpful:
1. The [timeout for ping
request|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L790-L791]
in *RecoveryStrategy.java* should be configurable.
2. The exception message in [line
797|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L797]
and [line
801|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L801]
in *RecoveryStrategy.java* should be logged as *error* messages instead of
*info* messages
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]