[jira] [Commented] (SOLR-17106) LBSolrClient: Make it configurable to remove zombie ping checks

Chris M. Hostetter (Jira) Tue, 16 Jan 2024 15:10:04 -0800


    [ 
https://issues.apache.org/jira/browse/SOLR-17106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17807491#comment-17807491
 ]


Chris M. Hostetter commented on SOLR-17106:
-------------------------------------------

Hi Aprna, thanks for creating this jira!

 

I read through the changes on your branch, and I'm a little confused by a few 
things...

 

1. What was your motivation for replacing {{setAliveCheckInterval(int 
aliveCheckInterval, TimeUnit unit)}} with 
{{setZombieStateMonitoringIntervalMillis(long 
zombieStateMonitoringIntervalMillis)}} ?

It appears that they ultimately serve the same purpose of dictating the 
{{scheduleAtFixedRate()}} executor, but changing the name just seems like an 
unnecessary back compat break?

(We also, AFAIK, don't currently use the term "zombie" in any of the {{public}} 
APIs of SolrJ – only in the internal tracking of what servers are "not alive" 
.... we probably want to keep the APIs in terms of being "alive" or "not alive")

Even if there is a strong reason I'm overlooking why it makes sense to change 
the method name: There's also been a push over the last few years to 
standardize on always taking in {{TimeUnit}} when dealing with method args that 
specify a time duration, rather then forcing the client to specify milliseconds 
(see SOLR-16595)

 

2. I don't think {{setMinZombieReleaseTimeMillis(long jailTimeInMillis)}} is 
being used the way you intended / expected?

>From the javadocs you added...
{quote}This param should be set if enableZombieChecks=false. This param 
configures the time a server should be regarded, at a minimum, as a zombie 
before being released
{quote}
...but from what I've seen of where & how the values are actually used:
 * In {{addZombie(...)}} it explicitly sets {{ServerWraper.remainingTime = 
zombieCheckIntervalMillis}}
 * In {{{}reduceRemainingZombieTime(...){}}}, if {{ServerWraper.remainingTime}} 
is non zero, explicitly set it to {{wrapper.remainingTime = Math.max(0, 
(wrapper.remainingTime - minZombieReleaseTimeMillis));}}

...that's it.

So no matter what the value {{minZombieReleaseTimeMillis}} is set to, the only 
thing that determines how long a URL is kept in the zombie list is 2x the 
{{zombieStateMonitoringIntervalMillis}} value used to configure the scheduled 
executor. The first time the executor calls {{reduceRemainingZombieTime(...)}} 
on a {{ServerWrapper}} it will reduce the {{remainingTime}} to zero. The Second 
time the executor calls {{reduceRemainingZombieTime(...)}} it will remove the 
URL from zombies and re-add it to alive.

I'm guessing what you ment to do is have {{reduceRemainingZombieTime(...)}} 
subtract {{zombieStateMonitoringIntervalMillis}} from {{remainingTime}} ? ... 
but this approach still seems kind of confusing & misleading, because tracking 
& recording "remaining milliseconds" like this implies more granularity then 
here really is.

{{remainingTime=10 (ms)}} is meaningless if 
{{zombieStateMonitoringIntervalMillis=60_000}} – you're going to have to wait 
the full 60 seconds.

I would suggest a "num iters" type configuration, rather then a "time" based 
configuration, so that we are very transparent that our "liveness checker" runs 
every {{aliveCheckInterval}} – and you can configure the "liveness checker" to 
only inspect a URL ever "N runs".

 
----
 

Thinking about the API of this change holistically, my inclination would be to 
support disabling the zombie check queries as a special case of making the 
specifics of those queries configurable.

 

Here's the API changes I would suggest:
 * Keep {{setAliveCheckInterval(...)}} the way it is
 * Add a new {{setAliveCheckSkipIters(int)}} that defaults to {{0}}
 ** Document that positive values can be used to ensure that the liveness 
checks for a given URL will only be run every "Nth" iteration of the 
{{setAliveCheckInterval}}
 * Add a new {{setAliveCheckQuery(SolrQuery)}} that defaults to the current 
static {{SolrQuery}}
 ** Document that if this is set to {{null}} the client will implicitly assume 
success anytime it would normally perform a a liveness check on a server.

The underlying functional changes would be:
 * Add an {{int skipAliveCheckIters}} to {{ServerWrapper}} that is initialized 
to whatever {{setAliveCheckSkipIters(int)}} is set to
 * In {{checkAZombieServer(...)}} ...
 ** Befor any existing logic: {{if (0 < zombieServer.skipAliveCheckIters)}} ... 
decrement and return immediately
 ** Short circuit the the current {{QueryRequest}} logic such that if 
{{setAliveCheckQuery(null)}} is in use, we don't execute any request, we just 
enter the "success" block

Unless I'm overlooking something, this would all be back-compatable, so we 
could merge it to the 9x branch, and then on the main branch we could change 
some of the defaults (like using {{setAliveCheckRequest(null)}} by default, 
increasing {{{}setAliveCheckSkipIters(int){}}}, decreasing 
{{{}setAliveCheckInterval(...){}}})

What do you think?

> LBSolrClient: Make it configurable to remove zombie ping checks
> ---------------------------------------------------------------
>
>                 Key: SOLR-17106
>                 URL: https://issues.apache.org/jira/browse/SOLR-17106
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Aparna Suresh
>            Priority: Minor
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Following discussion from a dev list discussion here: 
> [https://lists.apache.org/thread/f0zfmpg0t48xrtppyfsmfc5ltzsq2qqh]
> The issue involves scalability challenges in SolrJ's *LBSolrClient* when a 
> pod with numerous cores experiences connectivity problems. The "zombie" 
> tracking mechanism, operating on a core basis, becomes a bottleneck during 
> distributed search on a massive multi shard collection. Threads attempting to 
> reach unhealthy cores contribute to a high computational load, causing 
> performance issues. 
> As suggested by Chris Hostetter: LBSolrClient could be configured to disable 
> zombie "ping" checks, but retain zombie tracking. Once a replica/endpoint is 
> identified as a zombie, it could be held in zombie jail for X seconds, before 
> being released - hoping that by this timeframe ZK would be updated to mark 
> this endpoint DOWN or the pod is back up and CloudSolrClient would avoid 
> querying it. In any event, only 1 failed query would be needed to send the 
> server back to zombie jail.
>  
> There are benefits in doing this change:
>  * Eliminate the zombie ping requests, which would otherwise overload pod(s) 
> coming up after a restart
>  * Avoid memory leaks, in case a node/replica goes away permanently, but it 
> stays as zombie forever, with a background thread in LBSolrClient constantly 
> pinging it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-17106) LBSolrClient: Make it configurable to remove zombie ping checks

Reply via email to