[
https://issues.apache.org/jira/browse/SOLR-17106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17817128#comment-17817128
]
Aparna Suresh commented on SOLR-17106:
--------------------------------------
Thanks for the feedback! Sorry I did not have a chance to respond for a few
weeks - I was out sick with Covid initially and then tied up investigating
issues in Production. Appreciate the detailed evaluation.
I completely missed the point about backwards compatibility!
{quote}I'm guessing what you ment to do is have
{{reduceRemainingZombieTime(...)}} subtract
{{zombieStateMonitoringIntervalMillis}} from {{remainingTime}} ? ... but this
approach still seems kind of confusing & misleading, because tracking &
recording "remaining milliseconds" like this implies more granularity then here
really is.
{{remainingTime=10 (ms)}} is meaningless if
{{zombieStateMonitoringIntervalMillis=60_000}} – you're going to have to wait
the full 60 seconds.
{quote}
I specified an override to zombieStateMonitoringIntervalMillis = 5s in my first
commit on LBHttp2SolrClient, with remainingTime set to 10s. So the thread
running periodically doesnt evict a zombie entry right away, I added the
following if condition - but I agree that would keep some entries as zombies up
to the next run. Agree 100% about the point that the time based approach doesnt
provide a lot of flexibility compared to the numIters approach.
{code:java}
private void reduceRemainingZombieTime(ServerWrapper wrapper) {
if(wrapper == null){
return;
}
if (wrapper.remainingTime == 0) {
//evict from zombieServers, add to aliveServers
zombieServers.remove(wrapper.getBaseUrl());
wrapper.failedPings = 0;
if (wrapper.standard) {
addToAlive(wrapper);
}
} else {
wrapper.remainingTime = Math.max(0, (wrapper.remainingTime -
minZombieReleaseTimeMillis));
}
}
{code}
Have updated the PR based on your comments here:
[https://github.com/apache/solr/pull/2160/files]
> LBSolrClient: Make it configurable to remove zombie ping checks
> ---------------------------------------------------------------
>
> Key: SOLR-17106
> URL: https://issues.apache.org/jira/browse/SOLR-17106
> Project: Solr
> Issue Type: Improvement
> Reporter: Aparna Suresh
> Priority: Minor
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Following discussion from a dev list discussion here:
> [https://lists.apache.org/thread/f0zfmpg0t48xrtppyfsmfc5ltzsq2qqh]
> The issue involves scalability challenges in SolrJ's *LBSolrClient* when a
> pod with numerous cores experiences connectivity problems. The "zombie"
> tracking mechanism, operating on a core basis, becomes a bottleneck during
> distributed search on a massive multi shard collection. Threads attempting to
> reach unhealthy cores contribute to a high computational load, causing
> performance issues.
> As suggested by Chris Hostetter: LBSolrClient could be configured to disable
> zombie "ping" checks, but retain zombie tracking. Once a replica/endpoint is
> identified as a zombie, it could be held in zombie jail for X seconds, before
> being released - hoping that by this timeframe ZK would be updated to mark
> this endpoint DOWN or the pod is back up and CloudSolrClient would avoid
> querying it. In any event, only 1 failed query would be needed to send the
> server back to zombie jail.
>
> There are benefits in doing this change:
> * Eliminate the zombie ping requests, which would otherwise overload pod(s)
> coming up after a restart
> * Avoid memory leaks, in case a node/replica goes away permanently, but it
> stays as zombie forever, with a background thread in LBSolrClient constantly
> pinging it
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]