[
https://issues.apache.org/jira/browse/SOLR-17106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aparna Suresh updated SOLR-17106:
---------------------------------
Description:
Following discussion from a dev list discussion here:
[https://lists.apache.org/thread/f0zfmpg0t48xrtppyfsmfc5ltzsq2qqh]
The issue involves scalability challenges in SolrJ's *LBSolrClient* when a pod
with numerous cores experiences connectivity problems. The "zombie" tracking
mechanism, operating on a core basis, becomes a bottleneck during distributed
search on a massive multi shard collection. Threads attempting to reach
unhealthy cores contribute to a high computational load, causing performance
issues.
As suggested by Chris Hostetter: LBSolrClient could be configured to disable
zombie "ping" checks, but retain zombie tracking. Once a replica/endpoint is
identified as a zombie, it could be held in zombie jail for X seconds, before
being released - hoping that by this timeframe ZK would be updated to mark this
endpoint DOWN or the pod is back up and CloudSolrClient would avoid querying
it. In any event, only 1 failed query would be needed to send the server back
to zombie jail.
There are benefits in doing this change:
* Eliminate the zombie ping requests, which would otherwise overload pod(s)
coming up after a restart
* Avoid memory leaks, in case a node/replica goes away permanently, but it
stays as zombie forever, with a background thread in LBSolrClient constantly
pinging it
was:
Following discussion from a dev list discussion here:
https://lists.apache.org/thread/f0zfmpg0t48xrtppyfsmfc5ltzsq2qqh
The issue involves scalability challenges in SolrJ's *LBSolrClient* when a node
with numerous cores experiences connectivity problems. The "zombie" tracking
mechanism, operating on a core basis, becomes a bottleneck during distributed
search on a massive multi shard collection. Threads attempting to reach
unhealthy cores contribute to a high computational load, causing performance
issues.
As suggested by Chris Hostetter: LBSolrClient could be configured to disable
zombie "ping" checks, but retain zombie tracking. Once a server is identified
as a zombie, it could be held in zombie jail for X seconds, before being
released - hoping that by this timeframe ZK would be updated to mark this
server DOWN or the pod is back up and CloudSolrClient would avoid querying it.
In any event, only 1 failed query would be needed to send the server back to
zombie jail.
There are benefits in doing this change:
* Eliminate the zombie ping requests, which would otherwise overload pod(s)
coming up after a restart
* Avoid memory leaks, in case a node/replica goes away permanently, but it
stays as zombie forever, with a background thread in LBSolrClient constantly
pinging it
> LBSolrClient: Make it configurable to remove zombie ping checks
> ---------------------------------------------------------------
>
> Key: SOLR-17106
> URL: https://issues.apache.org/jira/browse/SOLR-17106
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Aparna Suresh
> Priority: Minor
>
> Following discussion from a dev list discussion here:
> [https://lists.apache.org/thread/f0zfmpg0t48xrtppyfsmfc5ltzsq2qqh]
> The issue involves scalability challenges in SolrJ's *LBSolrClient* when a
> pod with numerous cores experiences connectivity problems. The "zombie"
> tracking mechanism, operating on a core basis, becomes a bottleneck during
> distributed search on a massive multi shard collection. Threads attempting to
> reach unhealthy cores contribute to a high computational load, causing
> performance issues.
> As suggested by Chris Hostetter: LBSolrClient could be configured to disable
> zombie "ping" checks, but retain zombie tracking. Once a replica/endpoint is
> identified as a zombie, it could be held in zombie jail for X seconds, before
> being released - hoping that by this timeframe ZK would be updated to mark
> this endpoint DOWN or the pod is back up and CloudSolrClient would avoid
> querying it. In any event, only 1 failed query would be needed to send the
> server back to zombie jail.
>
> There are benefits in doing this change:
> * Eliminate the zombie ping requests, which would otherwise overload pod(s)
> coming up after a restart
> * Avoid memory leaks, in case a node/replica goes away permanently, but it
> stays as zombie forever, with a background thread in LBSolrClient constantly
> pinging it
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]