[
https://issues.apache.org/jira/browse/SOLR-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710392#comment-16710392
]
Hoss Man commented on SOLR-13028:
---------------------------------
So where exactly is the bug? Are there multiple bugs?
* All the code in question running on the solr nodes is using the
CloudSolrClient that comes from the SolrClientCloudManager (from the
ZkController) ... since every solr node has ZK watches for the live nodes (and
state.json for each collection) shouldn't each ZkController be able to purge
it's CloudSolrClient's connection pool if we know a node has gone down ?
* Sprinkled throughout the solr code base, we have a lot of places where
{{NoHttpResponseException}} is checked for explicitly and considered a
"communication error" that can be retried ... is that appropriate here in the
AutoScalling Policy/Snitch logic?
** Other methods in SolrClientNodeStateProvider have explicit checks for
{{cause instanceof NoHttpResponseException}} when encountering an exception in
which case they retry, but SolrClientNodeStateProvider's static inner class
AutoScalingSnitch doesn't – it has nearly cut/pasted code that only looks for
SocketException ... should this code path be retrying on
NoHttpResponseException?
** Ironically the error it throws if it retries on SocketException multiple
times w/o succeeding missleadingly claims it encountered a
NoHttpResponseException – suggesting blatent cut/paste abuse...
{code:java}
while (cnt++ < retries) {
try {
rsp = snitchContext.invoke(solrNode, CommonParams.METRICS_PATH,
params);
} catch (SolrException | SolrServerException | SocketException e) {
boolean hasCauseSocketException = false;
Throwable cause = e;
while (cause != null) {
if (cause instanceof SocketException) {
hasCauseSocketException = true;
break;
}
cause = cause.getCause();
}
if (hasCauseSocketException || e instanceof SocketException) {
log.info("Error on getting remote info, trying again: " +
e.getMessage());
Thread.sleep(500);
continue;
} else {
throw e;
}
}
}
if (cnt == retries) {
throw new SolrException(ErrorCode.SERVER_ERROR, "Could not get remote
info after many retries on NoHttpResponseException");
}
{code}
> Harden AutoAddReplicasPlanActionTest#testSimple
> -----------------------------------------------
>
> Key: SOLR-13028
> URL: https://issues.apache.org/jira/browse/SOLR-13028
> Project: Solr
> Issue Type: Sub-task
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Mark Miller
> Priority: Major
> Attachments: sarowe__Lucene-Solr-BadApple-tests-master__229.log.txt
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]