[
https://issues.apache.org/jira/browse/SOLR-17106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824591#comment-17824591
]
Chris M. Hostetter commented on SOLR-17106:
-------------------------------------------
Hi Aparna,
This looks pretty good ... but it still seems like a few things have slipped
through the cracks?
* In the {{Builder}} classes, {{aliveCheckQuery}} needs to to default to the
existing behavior which is currently in {{LBSolrClient}} 's {{private static
final SolrQuery solrQuery}}
** I would rename/refactor that to something like {{protected static final
SolrQuery DEFAULT_ALIVE_CHECK_QUERY}} and re-use it when initializing
{{aliveCheckQuery}} in both {{Builder}} classes
** IMO we should change the default to {{null}} on the {{main}} branch, but
first we need a patch that is cleanly backcompatible to commit & backport to
9x, then we can discuss changing the defaults on it's own merits
* In the actual {{LBSolrClient}} subclasses, the {{aliveCheckSkipIters}} and
{{aliveCheckQuery}} variables need to be final
* In {{checkAZombieServer}} you are inspection & decrementing
{{zombieServer.skipAliveCheckIters}} before doing the ping check – but in the
event that the ping fails, you are never re-setting the value of
{{zombieServer.skipAliveCheckIters}}
** So instead of only pinging ever Nth iter, it will ping on every iter
starting with the Nth iter
** {{handleServerDown}} should reset {{zombieServer.skipAliveCheckIters =
this.aliveCheckSkipIters;}}
* The new private {{pingServer}} and {{isServerAlive}} methods you added feel
like they are too intertwined to make sense as individual methods?
** Both have to explicitly check if {{aliveCheckQuery}} is null
** The return value of {{pingServer}} is only used by {{isServerAlive}} and no
where else.
** I think it would make more sense to just merge these two methods into...
{code:java}
private boolean isServerAlive(ServerWrapper zombieServer) throws
SolrServerException, IOException {
if (null == aliveCheckQuery) {
return true;
}
QueryRequest queryRequest = new QueryRequest(aliveCheckQuery);
queryRequest.setBasePath(zombieServer.baseUrl);
return
queryRequest.process(getClient(zombieServer.getBaseUrl())).getStatus() == 0;
}
{code}
...those are my functionality concerns, i also have some nit-picky concerns...
* Severally places in your patch modify lines in ways that only change
formatting
** If you run {{gradle tidy}} before committing to your branch these should be
automatically cleaned up
* It looks like you also added an unused import of
{{io.swagger.v3.oas.annotations.servers.Server}} ?
** {{gradle tidy}} (or maybe it's {{gradle check}} ?) should also catch this.
* I it would be good to have some {{@see}} tags or {{@link}} tags in the
javadocs of {{setAliveCheckSkipIters}} , {{{}setAliveCheckInterval{}}}, and
{{setAliveCheckQuery}} explaining how they all relate to each other.
* {{testReliability()}} is already not a very good test – it's certainly not
worth copy/pasting exactly as is just to change the client slightly
** It would be cleaner to just refactor it's body into a helper method (ie:
{{protected doTestReliability(LBSolrClient)}} ) that you call from both
{{testReliabilityWithLivenessChecks()}} and your new
{{testReliabilityWithMinimumZombieTimeAndDisabledQueries()}}
** Better still: we can make {{{}checkAZombieServer{}}}, {{isServerAlive}} do
some {{DEBUG}} level logging about the particulars of what it's doing (when it
skips a server because of the {{{}skipAliveCheckIters{}}}, when it assumes
success because {{aliveCheckQuery}} is null, when it gets a success or failure
result from a server, etc...) and then make tose tests use the {{@LogLevel}}
annotation along with the {{LogListener}} helper class (restricted to
substrings matching the name of the server we stop/restart) to assert it got
the expected log messages (we just can't be too picky about the number of times
those messages are logged)
> LBSolrClient: Make it configurable to remove zombie ping checks
> ---------------------------------------------------------------
>
> Key: SOLR-17106
> URL: https://issues.apache.org/jira/browse/SOLR-17106
> Project: Solr
> Issue Type: Improvement
> Reporter: Aparna Suresh
> Priority: Minor
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Following discussion from a dev list discussion here:
> [https://lists.apache.org/thread/f0zfmpg0t48xrtppyfsmfc5ltzsq2qqh]
> The issue involves scalability challenges in SolrJ's *LBSolrClient* when a
> pod with numerous cores experiences connectivity problems. The "zombie"
> tracking mechanism, operating on a core basis, becomes a bottleneck during
> distributed search on a massive multi shard collection. Threads attempting to
> reach unhealthy cores contribute to a high computational load, causing
> performance issues.
> As suggested by Chris Hostetter: LBSolrClient could be configured to disable
> zombie "ping" checks, but retain zombie tracking. Once a replica/endpoint is
> identified as a zombie, it could be held in zombie jail for X seconds, before
> being released - hoping that by this timeframe ZK would be updated to mark
> this endpoint DOWN or the pod is back up and CloudSolrClient would avoid
> querying it. In any event, only 1 failed query would be needed to send the
> server back to zombie jail.
>
> There are benefits in doing this change:
> * Eliminate the zombie ping requests, which would otherwise overload pod(s)
> coming up after a restart
> * Avoid memory leaks, in case a node/replica goes away permanently, but it
> stays as zombie forever, with a background thread in LBSolrClient constantly
> pinging it
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]