With lots of cores on a node, and a node having a network issue (or other node level issue preventing connectivity), I'm seeing that the SolrJ LBSolrClient "zombie" tracking is not really scalable. It works on a core basis, it doesn't know about nodes. Essentially a client or a healthy node tries to reach a core on an unhealthy node, and it fails, and is deemed a "zombie" (in the code). Imagine distributed search on a massive many-shard collection where hundreds of replicas are trying to be reached on this node at the same time, and failing. In such a circumstance, I've seen a thread dump showing a thousand threads (in state RUNNABLE, not blocked) all attempting to compute zombieServers.keySet().toString() (it's a ConcurrentHashMap) to produce an exception message. Yikes! A quick fix would be to omit that entirely, maybe if the number of zombies is above 10. I'll file a JIRA issue. Ideally the node would be deemed not live and then it wouldn't be reached again until it returned, but I think in this case the distributed search fan-out was so massive and thus ~instantaneous that there was no room for a more graceful transition.
Moreover, I wonder if others have found LBSolrClient to be problematic with so many cores per node? I'm seeing that tons of zombie checking queries go out when there's a wide problem. I think that LBSolrClient ought to know about the nodes and should try a node level healthceck ping before executing any core level requests. Maybe if the healthcheck failed then succeeded, and if all of a small sample of zombie cores there pass, assume they will all pass (don't send pings to all). Just a rough idea. ~ David Smiley Apache Lucene/Solr Search Developer http://www.linkedin.com/in/davidwsmiley