Re: How Zookeeper (and Puppet) brought down our Solr Cluster

Ben DeMott Tue, 14 Mar 2017 10:43:39 -0700

Hi Jan,

I created a Jira issue, and proposed a possible solution there.


Please feel free to comment if you have your own ideas.

Thanks for the response.

https://issues.apache.org/jira/browse/SOLR-10284

On Mon, Mar 13, 2017 at 5:52 PM, Jan Høydahl <[email protected]> wrote:

> Hi
>
> Thanks for reporting.
> As it may take some time before we get ZK 3.5.x out there it would be nice
> with a fix.
> Do you plan to make our zkClient somehow explicitly validate that all
> given zk nodes are “good”?
>
> Or is there some way we could fix this with documentation?
> I imagine, if we always propose to use a chroot, e.g.
> ZK_HOST=zoo1,zoo2,zoo3*/solr* then it would be a requirement to do a
> *mkroot* before being able to use ZK. And I assume that in that case if
> one of the ZK nodes got restarted without or with wrong configuration, it
> would startup with some other *data *folder(?) and refuse to serve any
> data whatsoever since the /solr root would not exist?
>
> I’d say, even if this is not a Solr bug per se, it is still worthy of a
> JIRA issue.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> 14. mar. 2017 kl. 00.11 skrev Ben DeMott <[email protected]>:
>
> So wanted to throw this out there, and get any feedback.
>
> We had a persistent issue with our Solr clusters doing crazy things, from
> running out of file-descriptors, to having replication issues, to filling
> up the /overseer/queue .... Just some of the log Exceptions:
>
> *o.e.j.s.ServerConnector java.io <http://java.io/>.IOException: Too many
> open files*
>
> *o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: Error
> trying to proxy request for
> url: http://10.50.64.4:8983/solr/efc-jobsearch-col/select
> <http://10.50.64.4:8983/solr/efc-jobsearch-col/select>*
>
> *o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException:
> ClusterState says we are the leader
> (http://10.50.64.4:8983/solr/efc-jobsearch-col_shard1_replica2
> <http://10.50.64.4:8983/solr/efc-jobsearch-col_shard1_replica2>), but
> locally we don't think so. Request came from null*
>
> *o.a.s.c.Overseer Bad version writing to ZK using compare-and-set, will
> force refresh cluster state: KeeperErrorCode = BadVersion for
> /collections/efc-jobsearch-col/state.json*
>
> *IndexFetcher File _5oz.nvd did not match. expected checksum is 3661731988
> and actual is checksum 840593658. expected length is 271091 and actual
> length is 271091*
>
>
> *...*
>
> I'll get to the point quickly.  This was all caused by a Zookeeper
> configuration on a particular node getting reset, for a period of seconds,
> and the service being restarted automatically.  When this happened, Solr's
> connection to Zookeeper would be reset, Solr would reconnect, to the
> Zookeeper node, which had a blank configuration and was in "STANDALONE"
> mode.  The changes to ZK that were registered by the Solr connection
> wouldn't be registered with the rest of the cluster.
>
> As a result the *cversion* of */live_nodes* would be ahead of the other
> servers by a version or two, but the zxid's would all by in-sync.  The
> nodes would never re-synchronize; as far as Zookeeper is concerned
> everything is synced up properly.  Also */live_nodes* would be a
> mis-matched mess, empty, or inconsistent, depending on where Solr's ZK
> connections were pointed, resulting in Client connections  returning some,
> wrong, or no "live nodes".
>
> Now, it specifically tells you to never connect to an inconsistent group
> of servers as it will play havoc with Zookeeper, and it did exactly this.
>
> As of Zookeeper 3.5 there is an option to NEVER ALLOW IT TO RUN IN
> STANDALONE which we will be using when a stable version is released.
>
> It caused absolute havoc within our cluster.
>
> So to summarize, if a Zookeeper ensemble host ever goes into "Standalone"
> even temporarily, Solr will be disconnected, and then (may) reconnect
> (depending on which ZK node it picks) and its updates will never by
> synchronized. Also it won't be able to coordinate any of its Cloud
> operations.
>
> So in the interest of being a good internet citizen I'm writing this up,
> is there any desire for a patch that would provide a configuration or jvm
> option to refuse to connect to nodes in standalone operation?   Obviously
> the built-in ZK server that comes with Solr runs in standalone mode, so
> this would only be an option for solr.in.sh.... But it would prevent Solr
> from bringing the entire cluster down, in the event a single ZK server was
> temporarily misconfigured, or lost it's configuration for some reason.
>
> Maybe this isn't worth addressing.  Thoughts?
>
>
>

Re: How Zookeeper (and Puppet) brought down our Solr Cluster

Reply via email to