Re: LBSolrClient and "zombie" check at core level vs node level

Chris Hostetter Tue, 21 Nov 2023 10:50:34 -0800

: By relying only on ZK we’d lose the ability to react quickly to issues that
: might not even make it to a ZK state change.
: 
: I prefer that we keep the current approach but fix the implementation
: (David’s node health check suggestion seems interesting, as would capping
: the number of pings sent to a given node over a certain duration) rather
: than change the approach completely and lose the ability to quickly isolate
: non functional entities.
: 
: By the way, if LBSolrClient removes from its zombie list replicas marked
: down in ZK (it should) and we still see a large number of pings, it means
: these replicas are still ACTIVE in ZK.
: A Solr Client relying only on ZK state might flood the problematic node
: even more.


All fair points.

Which makes me wonder if the real problem is that zombie *tracking* is 
tightly couple with zombie *checking* ?

IIRC LBSolrClient is already smart enough that it is willing to try to 
send a request to (one or more) zombie URLs as a last resort if the 
other URLs don't work -- and if a zombie server successfully responds 
to a request, LBSolrClient removes that URL from "zombie jail" 
(independent of when/if the "zombie ping check" thread has checked thta 
URL)

So maybe we keep all the "zombie tracking" around, but (make it 
configurable to) remove the zombie ping requests?   

Continue to put URLs in in zombie jail for X seconds, but at the end of X 
seconds we just let them out of jail, w/o any "zombie check" queries.  
Trusting that by the end of X seconds either the pod is back up, or zk has 
been updated to mark it down and CloudSolrClient will stop asking us to 
query it.  And even if neither of those things happen, only one (or a few 
concurrent) failed requests will be needed to put it back in jail (and 
LBSolrClient will retry them on other nodes).


Speaking of which: I haven't reviewed the code thoroughly, but is there 
currently a memory leak in LBSolrClient in a situation where a 
node/replica goes away permentantly as CloudSolrClient is making a request 
via LBSolrClient with that URL ? ... doesn't the current code add it to 
the zombie list and then just keep it there forever? (with the background 
thread constantly pinging it?)


And just to clarify: I'm not making these strawman proposals because I 
object to adding node level zombie checks before replica level -- i'm just 
trying to think about the problem in way that might *reduce* the total 
number of requests to solr nodes that (may be) struggling before jumping 
in to solutions that by definition increasing them.





-Hoss
http://www.lucidworks.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
For additional commands, e-mail: dev-h...@solr.apache.org

Re: LBSolrClient and "zombie" check at core level vs node level

Reply via email to