Hi Jan, I created a Jira issue, and proposed a possible solution there.
Please feel free to comment if you have your own ideas. Thanks for the response. https://issues.apache.org/jira/browse/SOLR-10284 On Mon, Mar 13, 2017 at 5:52 PM, Jan Høydahl <[email protected]> wrote: > Hi > > Thanks for reporting. > As it may take some time before we get ZK 3.5.x out there it would be nice > with a fix. > Do you plan to make our zkClient somehow explicitly validate that all > given zk nodes are “good”? > > Or is there some way we could fix this with documentation? > I imagine, if we always propose to use a chroot, e.g. > ZK_HOST=zoo1,zoo2,zoo3*/solr* then it would be a requirement to do a > *mkroot* before being able to use ZK. And I assume that in that case if > one of the ZK nodes got restarted without or with wrong configuration, it > would startup with some other *data *folder(?) and refuse to serve any > data whatsoever since the /solr root would not exist? > > I’d say, even if this is not a Solr bug per se, it is still worthy of a > JIRA issue. > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > > 14. mar. 2017 kl. 00.11 skrev Ben DeMott <[email protected]>: > > So wanted to throw this out there, and get any feedback. > > We had a persistent issue with our Solr clusters doing crazy things, from > running out of file-descriptors, to having replication issues, to filling > up the /overseer/queue .... Just some of the log Exceptions: > > *o.e.j.s.ServerConnector java.io <http://java.io/>.IOException: Too many > open files* > > *o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: Error > trying to proxy request for > url: http://10.50.64.4:8983/solr/efc-jobsearch-col/select > <http://10.50.64.4:8983/solr/efc-jobsearch-col/select>* > > *o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: > ClusterState says we are the leader > (http://10.50.64.4:8983/solr/efc-jobsearch-col_shard1_replica2 > <http://10.50.64.4:8983/solr/efc-jobsearch-col_shard1_replica2>), but > locally we don't think so. Request came from null* > > *o.a.s.c.Overseer Bad version writing to ZK using compare-and-set, will > force refresh cluster state: KeeperErrorCode = BadVersion for > /collections/efc-jobsearch-col/state.json* > > *IndexFetcher File _5oz.nvd did not match. expected checksum is 3661731988 > and actual is checksum 840593658. expected length is 271091 and actual > length is 271091* > > > *...* > > I'll get to the point quickly. This was all caused by a Zookeeper > configuration on a particular node getting reset, for a period of seconds, > and the service being restarted automatically. When this happened, Solr's > connection to Zookeeper would be reset, Solr would reconnect, to the > Zookeeper node, which had a blank configuration and was in "STANDALONE" > mode. The changes to ZK that were registered by the Solr connection > wouldn't be registered with the rest of the cluster. > > As a result the *cversion* of */live_nodes* would be ahead of the other > servers by a version or two, but the zxid's would all by in-sync. The > nodes would never re-synchronize; as far as Zookeeper is concerned > everything is synced up properly. Also */live_nodes* would be a > mis-matched mess, empty, or inconsistent, depending on where Solr's ZK > connections were pointed, resulting in Client connections returning some, > wrong, or no "live nodes". > > Now, it specifically tells you to never connect to an inconsistent group > of servers as it will play havoc with Zookeeper, and it did exactly this. > > As of Zookeeper 3.5 there is an option to NEVER ALLOW IT TO RUN IN > STANDALONE which we will be using when a stable version is released. > > It caused absolute havoc within our cluster. > > So to summarize, if a Zookeeper ensemble host ever goes into "Standalone" > even temporarily, Solr will be disconnected, and then (may) reconnect > (depending on which ZK node it picks) and its updates will never by > synchronized. Also it won't be able to coordinate any of its Cloud > operations. > > So in the interest of being a good internet citizen I'm writing this up, > is there any desire for a patch that would provide a configuration or jvm > option to refuse to connect to nodes in standalone operation? Obviously > the built-in ZK server that comes with Solr runs in standalone mode, so > this would only be an option for solr.in.sh.... But it would prevent Solr > from bringing the entire cluster down, in the event a single ZK server was > temporarily misconfigured, or lost it's configuration for some reason. > > Maybe this isn't worth addressing. Thoughts? > > >
