How Zookeeper (and Puppet) brought down our Solr Cluster

Ben DeMott Mon, 13 Mar 2017 16:19:48 -0700

So wanted to throw this out there, and get any feedback.

We had a persistent issue with our Solr clusters doing crazy things, from
running out of file-descriptors, to having replication issues, to filling
up the /overseer/queue .... Just some of the log Exceptions:


*o.e.j.s.ServerConnector java.io <http://java.io/>.IOException: Too many
open files*

*o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: Error
trying to proxy request for
url: http://10.50.64.4:8983/solr/efc-jobsearch-col/select
<http://10.50.64.4:8983/solr/efc-jobsearch-col/select>*

*o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException:
ClusterState says we are the leader
(http://10.50.64.4:8983/solr/efc-jobsearch-col_shard1_replica2
<http://10.50.64.4:8983/solr/efc-jobsearch-col_shard1_replica2>), but
locally we don't think so. Request came from null*

*o.a.s.c.Overseer Bad version writing to ZK using compare-and-set, will
force refresh cluster state: KeeperErrorCode = BadVersion for
/collections/efc-jobsearch-col/state.json*

*IndexFetcher File _5oz.nvd did not match. expected checksum is 3661731988
and actual is checksum 840593658. expected length is 271091 and actual
length is 271091*


*...*

I'll get to the point quickly.  This was all caused by a Zookeeper
configuration on a particular node getting reset, for a period of seconds,
and the service being restarted automatically.  When this happened, Solr's
connection to Zookeeper would be reset, Solr would reconnect, to the
Zookeeper node, which had a blank configuration and was in "STANDALONE"
mode.  The changes to ZK that were registered by the Solr connection
wouldn't be registered with the rest of the cluster.

As a result the *cversion* of */live_nodes* would be ahead of the other
servers by a version or two, but the zxid's would all by in-sync.  The
nodes would never re-synchronize; as far as Zookeeper is concerned
everything is synced up properly.  Also */live_nodes* would be a
mis-matched mess, empty, or inconsistent, depending on where Solr's ZK
connections were pointed, resulting in Client connections  returning some,
wrong, or no "live nodes".

Now, it specifically tells you to never connect to an inconsistent group of
servers as it will play havoc with Zookeeper, and it did exactly this.

As of Zookeeper 3.5 there is an option to NEVER ALLOW IT TO RUN IN
STANDALONE which we will be using when a stable version is released.

It caused absolute havoc within our cluster.

So to summarize, if a Zookeeper ensemble host ever goes into "Standalone"
even temporarily, Solr will be disconnected, and then (may) reconnect
(depending on which ZK node it picks) and its updates will never by
synchronized. Also it won't be able to coordinate any of its Cloud
operations.

So in the interest of being a good internet citizen I'm writing this up, is
there any desire for a patch that would provide a configuration or jvm
option to refuse to connect to nodes in standalone operation?   Obviously
the built-in ZK server that comes with Solr runs in standalone mode, so
this would only be an option for solr.in.sh.... But it would prevent Solr
from bringing the entire cluster down, in the event a single ZK server was
temporarily misconfigured, or lost it's configuration for some reason.

Maybe this isn't worth addressing.  Thoughts?

How Zookeeper (and Puppet) brought down our Solr Cluster

Reply via email to