Aparna Suresh created SOLR-17106:
------------------------------------
Summary: LBSolrClient: Make it configurable to remove zombie ping
checks
Key: SOLR-17106
URL: https://issues.apache.org/jira/browse/SOLR-17106
Project: Solr
Issue Type: Improvement
Security Level: Public (Default Security Level. Issues are Public)
Reporter: Aparna Suresh
Following discussion from a dev list discussion here:
https://lists.apache.org/thread/f0zfmpg0t48xrtppyfsmfc5ltzsq2qqh
The issue involves scalability challenges in SolrJ's *LBSolrClient* when a node
with numerous cores experiences connectivity problems. The "zombie" tracking
mechanism, operating on a core basis, becomes a bottleneck during distributed
search on a massive multi shard collection. Threads attempting to reach
unhealthy cores contribute to a high computational load, causing performance
issues.
As suggested by Chris Hostetter: LBSolrClient could be configured to disable
zombie "ping" checks, but retain zombie tracking. Once a server is identified
as a zombie, it could be held in zombie jail for X seconds, before being
released - hoping that by this timeframe ZK would be updated to mark this
server DOWN or the pod is back up and CloudSolrClient would avoid querying it.
In any event, only 1 failed query would be needed to send the server back to
zombie jail.
There are benefits in doing this change:
* Eliminate the zombie ping requests, which would otherwise overload pod(s)
coming up after a restart
* Avoid memory leaks, in case a node/replica goes away permanently, but it
stays as zombie forever, with a background thread in LBSolrClient constantly
pinging it
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]