So wanted to throw this out there, and get any feedback. We had a persistent issue with our Solr clusters doing crazy things, from running out of file-descriptors, to having replication issues, to filling up the /overseer/queue .... Just some of the log Exceptions:
*o.e.j.s.ServerConnector java.io <http://java.io/>.IOException: Too many open files* *o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: Error trying to proxy request for url: http://10.50.64.4:8983/solr/efc-jobsearch-col/select <http://10.50.64.4:8983/solr/efc-jobsearch-col/select>* *o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ClusterState says we are the leader (http://10.50.64.4:8983/solr/efc-jobsearch-col_shard1_replica2 <http://10.50.64.4:8983/solr/efc-jobsearch-col_shard1_replica2>), but locally we don't think so. Request came from null* *o.a.s.c.Overseer Bad version writing to ZK using compare-and-set, will force refresh cluster state: KeeperErrorCode = BadVersion for /collections/efc-jobsearch-col/state.json* *IndexFetcher File _5oz.nvd did not match. expected checksum is 3661731988 and actual is checksum 840593658. expected length is 271091 and actual length is 271091* *...* I'll get to the point quickly. This was all caused by a Zookeeper configuration on a particular node getting reset, for a period of seconds, and the service being restarted automatically. When this happened, Solr's connection to Zookeeper would be reset, Solr would reconnect, to the Zookeeper node, which had a blank configuration and was in "STANDALONE" mode. The changes to ZK that were registered by the Solr connection wouldn't be registered with the rest of the cluster. As a result the *cversion* of */live_nodes* would be ahead of the other servers by a version or two, but the zxid's would all by in-sync. The nodes would never re-synchronize; as far as Zookeeper is concerned everything is synced up properly. Also */live_nodes* would be a mis-matched mess, empty, or inconsistent, depending on where Solr's ZK connections were pointed, resulting in Client connections returning some, wrong, or no "live nodes". Now, it specifically tells you to never connect to an inconsistent group of servers as it will play havoc with Zookeeper, and it did exactly this. As of Zookeeper 3.5 there is an option to NEVER ALLOW IT TO RUN IN STANDALONE which we will be using when a stable version is released. It caused absolute havoc within our cluster. So to summarize, if a Zookeeper ensemble host ever goes into "Standalone" even temporarily, Solr will be disconnected, and then (may) reconnect (depending on which ZK node it picks) and its updates will never by synchronized. Also it won't be able to coordinate any of its Cloud operations. So in the interest of being a good internet citizen I'm writing this up, is there any desire for a patch that would provide a configuration or jvm option to refuse to connect to nodes in standalone operation? Obviously the built-in ZK server that comes with Solr runs in standalone mode, so this would only be an option for solr.in.sh.... But it would prevent Solr from bringing the entire cluster down, in the event a single ZK server was temporarily misconfigured, or lost it's configuration for some reason. Maybe this isn't worth addressing. Thoughts?
