LBSolrClient and "zombie" check at core level vs node level

David Smiley Wed, 15 Nov 2023 20:29:37 -0800

With lots of cores on a node, and a node having a network issue (or other
node level issue preventing connectivity), I'm seeing that the SolrJ
LBSolrClient "zombie" tracking is not really scalable.  It works on a core
basis, it doesn't know about nodes.  Essentially a client or a healthy node
tries to reach a core on an unhealthy node, and it fails, and is deemed a
"zombie" (in the code).  Imagine distributed search on a massive many-shard
collection where hundreds of replicas are trying to be reached on this node
at the same time, and failing.  In such a circumstance, I've seen a thread
dump showing a thousand threads (in state RUNNABLE, not blocked) all
attempting to compute zombieServers.keySet().toString() (it's a
ConcurrentHashMap) to produce an exception message.  Yikes!  A quick fix
would be to omit that entirely, maybe if the number of zombies is above
10.  I'll file a JIRA issue.  Ideally the node would be deemed not live and
then it wouldn't be reached again until it returned, but I think in this
case the distributed search fan-out was so massive and thus ~instantaneous
that there was no room for a more graceful transition.


Moreover, I wonder if others have found LBSolrClient to be problematic with
so many cores per node?  I'm seeing that tons of zombie checking queries go
out when there's a wide problem.  I think that LBSolrClient ought to know
about the nodes and should try a node level healthceck ping before
executing any core level requests.  Maybe if the healthcheck failed then
succeeded, and if all of a small sample of zombie cores there pass, assume
they will all pass (don't send pings to all).  Just a rough idea.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

LBSolrClient and "zombie" check at core level vs node level

Reply via email to