Yes, publishing state for other nodes seems like a problem waiting to happen.
Let Zookeeper be the single source of truth and live with the extra traffic. How about a first change to ignore any second-hand state, then later we can stop sending it and collapse the queues. Maybe receiving queues could be collapsed as soon as they ignore the second-hand state. wunder Walter Underwood [email protected] http://observer.wunderwood.org/ (my blog) > On Nov 23, 2016, at 3:08 PM, Mark Miller <[email protected]> wrote: > > Although, if we fixed that the leader sometimes publishes state for replicas > (which I think is a mistake, I worked hard initially to avoid a node ever > publishing state for another node) you could at least track the last state > Published and avoid repeating it over and over pretty easily. > On Wed, Nov 23, 2016 at 6:03 PM Mark Miller <[email protected] > <mailto:[email protected]>> wrote: > I didn't say stale state though actually. I said state progressions. > On Wed, Nov 23, 2016 at 6:03 PM Mark Miller <[email protected] > <mailto:[email protected]>> wrote: > Because that's how it works. > On Wed, Nov 23, 2016 at 5:57 PM Walter Underwood <[email protected] > <mailto:[email protected]>> wrote: > Why would other nodes need to see stale state? > > If they really need intermediate state changes, that sounds like a problem. > > wunder > Walter Underwood > [email protected] <mailto:[email protected]> > http://observer.wunderwood.org/ <http://observer.wunderwood.org/> (my blog) > > >> On Nov 23, 2016, at 2:53 PM, Mark Miller <[email protected] >> <mailto:[email protected]>> wrote: >> >> In many cases other nodes need to see a progression of state changes. You >> really have to clear the deck and try to start from 0. >> On Wed, Nov 23, 2016 at 5:50 PM Walter Underwood <[email protected] >> <mailto:[email protected]>> wrote: >> If the queue is local and the state messages are complete, the local queue >> should only send the latest, most accurate update. The rest can be skipped. >> >> The same could be done on the receiving end. Suck the queue dry, then choose >> the most recent. >> >> If the updates depend on previous updates, it would be a lot more work to >> compile the latest delta. >> >> wunder >> >> Walter Underwood >> [email protected] <mailto:[email protected]> >> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> (my blog) >> >> >>> On Nov 23, 2016, at 2:45 PM, Mark Miller <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> I talked about this type of thing with Jessica at Lucene / Sole revolution. >>> One thing is, when you reconnect after connecting to ZK, it should now >>> efficiently set every core as down in a single command, not each core. >>> Beyond that, any single node knows how fast it's sending overseer updates. >>> Each should have a governor. If the rate is too high, a node should know >>> it's best to just forgive up and assume things are screwed. It could try >>> and reset from ground zero. >>> >>> There are other things things that can be done, but given the current >>> design, the simplest win is that a replica can easily prevent itself from >>> spamming the overseer queue. >>> >>> Mark >>> On Wed, Nov 23, 2016 at 5:05 PM Scott Blum <[email protected] >>> <mailto:[email protected]>> wrote: >>> I've been fighting fires the last day where certain of our solr nodes will >>> have a long GC pauses that cause them to lose their ZK connection and have >>> to reconnect. That would be annoying, but survivable, although obvious >>> it's something I want to fix. >>> >>> But what makes it fatal is the current design of the state update queue. >>> >>> Every time one of our nodes flaps, it ends up shoving thousands of state >>> updates and leader requests onto the queue, most of them ultimately futile. >>> By the time the state is actually published, it's already stale. At one >>> point we had 400,000 items in the queue and I just had to declare >>> bankruptcy, delete the entire queue, and elect a new overseer. Later, we >>> had 70,000 items from several flaps that took an hour to churn through. >>> even after I'd shut down the problematic nodes. Again, almost entirely >>> useless, repetitive work. >>> >>> Digging through ZKController and related code, the current model just seems >>> terribly outdated and non-scalable now. If a node flaps for just a moment, >>> do we really need to laboriously update every core's state down, just to >>> mark it up again? What purpose does this serve that isn't already served >>> by the global live_nodes presence indication and/or leader election nodes? >>> >>> Rebooting a node creates a similar set of problems, a couple hundred cores >>> end up generating thousands of ZK operations to just to back to normal >>> state. >>> >>> We're at enough of breaking point that I have to do something here for our >>> own cluster. I would love to put my head together with some of the more >>> knowledgeable Solr operations folks to help redesign something that could >>> land in master and improve scalability for everyone. I'd also love to hear >>> about any prior art or experiments folks have done. And if there are >>> already efforts in process to address this very issue, apologies for being >>> out of the loop. >>> >>> Thanks! >>> Scott >>> >>> -- >>> - Mark >>> about.me/markrmiller <http://about.me/markrmiller> >> -- >> - Mark >> about.me/markrmiller <http://about.me/markrmiller> > -- > - Mark > about.me/markrmiller <http://about.me/markrmiller> > -- > - Mark > about.me/markrmiller <http://about.me/markrmiller> > -- > - Mark > about.me/markrmiller <http://about.me/markrmiller>
