I believe state in ZK should be used to decide that an entity is
down/unreachable, but given updates to entity states is slow requiring
session expiration first, then the time for the watches to notify, it
cannot IMO be the only parameter.

The proof of the API is in the call (pudding analogy).

I find sound the existing approach in which ZK provides a set of targets to
try, but these destinations are considered temporarily unreachable if they
don’t do the job regardless of what ZK says.

By relying only on ZK we’d lose the ability to react quickly to issues that
might not even make it to a ZK state change.

I prefer that we keep the current approach but fix the implementation
(David’s node health check suggestion seems interesting, as would capping
the number of pings sent to a given node over a certain duration) rather
than change the approach completely and lose the ability to quickly isolate
non functional entities.

By the way, if LBSolrClient removes from its zombie list replicas marked
down in ZK (it should) and we still see a large number of pings, it means
these replicas are still ACTIVE in ZK.
A Solr Client relying only on ZK state might flood the problematic node
even more.

Ilan

On Mon 20 Nov 2023 at 16:07, David Smiley <dsmi...@apache.org> wrote:

> That's a really fine idea Hoss!
>
> After reviewing LBSolrClient again, I think your proposal would best be a
> new SolrClient subclass.  LBSolrClient has a fair amount of state tracking
> but a failover-only client would track no state.  Perhaps LBSolrClient
> might subclass it or not.
>
> After some discussion with my colleagues, we might try an experiment that
> attempts this and see how it goes.  I suspect it'll be a net positive.  It
> takes some production bake time to really get confidence in something of
> this nature.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Thu, Nov 16, 2023 at 1:55 PM Chris Hostetter <hossman_luc...@fucit.org>
> wrote:
>
> >
> > I think it's worth rememberinbg that LBSolrClient, and it's design,
> > pre-dates SolrCloud and all of the ZK plumbing we have to know when nodes
> > & replicas are "live" ... it was written at a time when people had to
> > manually specify the list of solr servers and cores themselve when
> sending
> > requests.
> >
> > Then when SolrCloud was added, the "zk aware" CloudSolrClient logic was
> > wrapped ARROUND LBSolrClient -- CloudSolrClient already has some idea
> what
> > nodes & replicas are "live" when it sends the request, but LBSolrClient
> > doesn't so...
> >
> > : out when there's a wide problem.  I think that LBSolrClient ought to
> know
> > : about the nodes and should try a node level healthceck ping before
> > : executing any core level requests.  Maybe if the healthcheck failed
> then
> > : succeeded, and if all of a small sample of zombie cores there pass,
> > assume
> > : they will all pass (don't send pings to all).  Just a rough idea.
> >
> > ...i think it's worth considering an inverse idea: make it configurable
> > (and probably change the default given the common usecase is SolrCloud)
> to
> > build a LBSolrClient that does *NO* zombie tracking at all -- it just
> > continues to use the multiple URL options it's given for each request to
> > retry on (certain types of failures).
> >
> > Leave the "live" node/replica tracking to the CloudSolrClient layer, and
> > if there are code paths where it's possible CloudSolrClient is pasing
> > stale lists of replica URLs to LBSolrClient that it (should) already know
> > are not alive (via zk watchers), let's treat those as bugs in
> > CloudSolrClient and fix them.
> >
> >
> >
> > -Hoss
> > http://www.lucidworks.com/
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
> > For additional commands, e-mail: dev-h...@solr.apache.org
> >
> >
>

Reply via email to