[
https://issues.apache.org/jira/browse/SOLR-17106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17807491#comment-17807491
]
Chris M. Hostetter commented on SOLR-17106:
-------------------------------------------
Hi Aprna, thanks for creating this jira!
I read through the changes on your branch, and I'm a little confused by a few
things...
1. What was your motivation for replacing {{setAliveCheckInterval(int
aliveCheckInterval, TimeUnit unit)}} with
{{setZombieStateMonitoringIntervalMillis(long
zombieStateMonitoringIntervalMillis)}} ?
It appears that they ultimately serve the same purpose of dictating the
{{scheduleAtFixedRate()}} executor, but changing the name just seems like an
unnecessary back compat break?
(We also, AFAIK, don't currently use the term "zombie" in any of the {{public}}
APIs of SolrJ – only in the internal tracking of what servers are "not alive"
.... we probably want to keep the APIs in terms of being "alive" or "not alive")
Even if there is a strong reason I'm overlooking why it makes sense to change
the method name: There's also been a push over the last few years to
standardize on always taking in {{TimeUnit}} when dealing with method args that
specify a time duration, rather then forcing the client to specify milliseconds
(see SOLR-16595)
2. I don't think {{setMinZombieReleaseTimeMillis(long jailTimeInMillis)}} is
being used the way you intended / expected?
>From the javadocs you added...
{quote}This param should be set if enableZombieChecks=false. This param
configures the time a server should be regarded, at a minimum, as a zombie
before being released
{quote}
...but from what I've seen of where & how the values are actually used:
* In {{addZombie(...)}} it explicitly sets {{ServerWraper.remainingTime =
zombieCheckIntervalMillis}}
* In {{{}reduceRemainingZombieTime(...){}}}, if {{ServerWraper.remainingTime}}
is non zero, explicitly set it to {{wrapper.remainingTime = Math.max(0,
(wrapper.remainingTime - minZombieReleaseTimeMillis));}}
...that's it.
So no matter what the value {{minZombieReleaseTimeMillis}} is set to, the only
thing that determines how long a URL is kept in the zombie list is 2x the
{{zombieStateMonitoringIntervalMillis}} value used to configure the scheduled
executor. The first time the executor calls {{reduceRemainingZombieTime(...)}}
on a {{ServerWrapper}} it will reduce the {{remainingTime}} to zero. The Second
time the executor calls {{reduceRemainingZombieTime(...)}} it will remove the
URL from zombies and re-add it to alive.
I'm guessing what you ment to do is have {{reduceRemainingZombieTime(...)}}
subtract {{zombieStateMonitoringIntervalMillis}} from {{remainingTime}} ? ...
but this approach still seems kind of confusing & misleading, because tracking
& recording "remaining milliseconds" like this implies more granularity then
here really is.
{{remainingTime=10 (ms)}} is meaningless if
{{zombieStateMonitoringIntervalMillis=60_000}} – you're going to have to wait
the full 60 seconds.
I would suggest a "num iters" type configuration, rather then a "time" based
configuration, so that we are very transparent that our "liveness checker" runs
every {{aliveCheckInterval}} – and you can configure the "liveness checker" to
only inspect a URL ever "N runs".
----
Thinking about the API of this change holistically, my inclination would be to
support disabling the zombie check queries as a special case of making the
specifics of those queries configurable.
Here's the API changes I would suggest:
* Keep {{setAliveCheckInterval(...)}} the way it is
* Add a new {{setAliveCheckSkipIters(int)}} that defaults to {{0}}
** Document that positive values can be used to ensure that the liveness
checks for a given URL will only be run every "Nth" iteration of the
{{setAliveCheckInterval}}
* Add a new {{setAliveCheckQuery(SolrQuery)}} that defaults to the current
static {{SolrQuery}}
** Document that if this is set to {{null}} the client will implicitly assume
success anytime it would normally perform a a liveness check on a server.
The underlying functional changes would be:
* Add an {{int skipAliveCheckIters}} to {{ServerWrapper}} that is initialized
to whatever {{setAliveCheckSkipIters(int)}} is set to
* In {{checkAZombieServer(...)}} ...
** Befor any existing logic: {{if (0 < zombieServer.skipAliveCheckIters)}} ...
decrement and return immediately
** Short circuit the the current {{QueryRequest}} logic such that if
{{setAliveCheckQuery(null)}} is in use, we don't execute any request, we just
enter the "success" block
Unless I'm overlooking something, this would all be back-compatable, so we
could merge it to the 9x branch, and then on the main branch we could change
some of the defaults (like using {{setAliveCheckRequest(null)}} by default,
increasing {{{}setAliveCheckSkipIters(int){}}}, decreasing
{{{}setAliveCheckInterval(...){}}})
What do you think?
> LBSolrClient: Make it configurable to remove zombie ping checks
> ---------------------------------------------------------------
>
> Key: SOLR-17106
> URL: https://issues.apache.org/jira/browse/SOLR-17106
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Aparna Suresh
> Priority: Minor
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Following discussion from a dev list discussion here:
> [https://lists.apache.org/thread/f0zfmpg0t48xrtppyfsmfc5ltzsq2qqh]
> The issue involves scalability challenges in SolrJ's *LBSolrClient* when a
> pod with numerous cores experiences connectivity problems. The "zombie"
> tracking mechanism, operating on a core basis, becomes a bottleneck during
> distributed search on a massive multi shard collection. Threads attempting to
> reach unhealthy cores contribute to a high computational load, causing
> performance issues.
> As suggested by Chris Hostetter: LBSolrClient could be configured to disable
> zombie "ping" checks, but retain zombie tracking. Once a replica/endpoint is
> identified as a zombie, it could be held in zombie jail for X seconds, before
> being released - hoping that by this timeframe ZK would be updated to mark
> this endpoint DOWN or the pod is back up and CloudSolrClient would avoid
> querying it. In any event, only 1 failed query would be needed to send the
> server back to zombie jail.
>
> There are benefits in doing this change:
> * Eliminate the zombie ping requests, which would otherwise overload pod(s)
> coming up after a restart
> * Avoid memory leaks, in case a node/replica goes away permanently, but it
> stays as zombie forever, with a background thread in LBSolrClient constantly
> pinging it
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]