: By relying only on ZK we’d lose the ability to react quickly to issues that : might not even make it to a ZK state change. : : I prefer that we keep the current approach but fix the implementation : (David’s node health check suggestion seems interesting, as would capping : the number of pings sent to a given node over a certain duration) rather : than change the approach completely and lose the ability to quickly isolate : non functional entities. : : By the way, if LBSolrClient removes from its zombie list replicas marked : down in ZK (it should) and we still see a large number of pings, it means : these replicas are still ACTIVE in ZK. : A Solr Client relying only on ZK state might flood the problematic node : even more.
All fair points. Which makes me wonder if the real problem is that zombie *tracking* is tightly couple with zombie *checking* ? IIRC LBSolrClient is already smart enough that it is willing to try to send a request to (one or more) zombie URLs as a last resort if the other URLs don't work -- and if a zombie server successfully responds to a request, LBSolrClient removes that URL from "zombie jail" (independent of when/if the "zombie ping check" thread has checked thta URL) So maybe we keep all the "zombie tracking" around, but (make it configurable to) remove the zombie ping requests? Continue to put URLs in in zombie jail for X seconds, but at the end of X seconds we just let them out of jail, w/o any "zombie check" queries. Trusting that by the end of X seconds either the pod is back up, or zk has been updated to mark it down and CloudSolrClient will stop asking us to query it. And even if neither of those things happen, only one (or a few concurrent) failed requests will be needed to put it back in jail (and LBSolrClient will retry them on other nodes). Speaking of which: I haven't reviewed the code thoroughly, but is there currently a memory leak in LBSolrClient in a situation where a node/replica goes away permentantly as CloudSolrClient is making a request via LBSolrClient with that URL ? ... doesn't the current code add it to the zombie list and then just keep it there forever? (with the background thread constantly pinging it?) And just to clarify: I'm not making these strawman proposals because I object to adding node level zombie checks before replica level -- i'm just trying to think about the problem in way that might *reduce* the total number of requests to solr nodes that (may be) struggling before jumping in to solutions that by definition increasing them. -Hoss http://www.lucidworks.com/
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org For additional commands, e-mail: dev-h...@solr.apache.org