Re: Massive state-update bottleneck at scale

Walter Underwood Wed, 23 Nov 2016 15:26:34 -0800

Yes, publishing state for other nodes seems like a problem waiting to happen.


Let Zookeeper be the single source of truth and live with the extra traffic.

How about a first change to ignore any second-hand state, then later we can 
stop sending it and collapse the queues. Maybe receiving queues could be 
collapsed as soon as they ignore the second-hand state.

wunder
Walter Underwood
[email protected]
http://observer.wunderwood.org/  (my blog)

> On Nov 23, 2016, at 3:08 PM, Mark Miller <[email protected]> wrote:
> 
> Although, if we fixed that the leader sometimes publishes state for replicas 
> (which I think is a mistake, I worked hard initially to avoid a node ever 
> publishing state for another node) you could at least track the last state
> Published and avoid repeating it over and over pretty easily. 
> On Wed, Nov 23, 2016 at 6:03 PM Mark Miller <[email protected] 
> <mailto:[email protected]>> wrote:
> I didn't say stale state though actually. I said state progressions. 
> On Wed, Nov 23, 2016 at 6:03 PM Mark Miller <[email protected] 
> <mailto:[email protected]>> wrote:
> Because that's how it works. 
> On Wed, Nov 23, 2016 at 5:57 PM Walter Underwood <[email protected] 
> <mailto:[email protected]>> wrote:
> Why would other nodes need to see stale state?
> 
> If they really need intermediate state changes, that sounds like a problem.
> 
> wunder
> Walter Underwood
> [email protected] <mailto:[email protected]>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
> 
> 
>> On Nov 23, 2016, at 2:53 PM, Mark Miller <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> In many cases other nodes need to see a progression of state changes. You 
>> really have to clear the deck and try to start from 0. 
>> On Wed, Nov 23, 2016 at 5:50 PM Walter Underwood <[email protected] 
>> <mailto:[email protected]>> wrote:
>> If the queue is local and the state messages are complete, the local queue 
>> should only send the latest, most accurate update. The rest can be skipped.
>> 
>> The same could be done on the receiving end. Suck the queue dry, then choose 
>> the most recent.
>> 
>> If the updates depend on previous updates, it would be a lot more work to 
>> compile the latest delta.
>> 
>> wunder
>> 
>> Walter Underwood
>> [email protected] <mailto:[email protected]>
>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
>> 
>> 
>>> On Nov 23, 2016, at 2:45 PM, Mark Miller <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> I talked about this type of thing with Jessica at Lucene / Sole revolution. 
>>> One thing is, when you reconnect after connecting to ZK, it should now 
>>> efficiently set every core as down in a single command, not each core. 
>>> Beyond that, any single node knows how fast it's sending overseer updates. 
>>> Each should have a governor. If the rate is too high, a node should know 
>>> it's best to just forgive up and assume things are screwed. It could try 
>>> and reset from ground zero.
>>> 
>>> There are other things things that can be done, but given the current 
>>> design, the simplest win is that a replica can easily prevent itself from 
>>> spamming the overseer queue. 
>>> 
>>> Mark
>>> On Wed, Nov 23, 2016 at 5:05 PM Scott Blum <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> I've been fighting fires the last day where certain of our solr nodes will 
>>> have a long GC pauses that cause them to lose their ZK connection and have 
>>> to reconnect.  That would be annoying, but survivable, although obvious 
>>> it's something I want to fix.
>>> 
>>> But what makes it fatal is the current design of the state update queue.
>>> 
>>> Every time one of our nodes flaps, it ends up shoving thousands of state 
>>> updates and leader requests onto the queue, most of them ultimately futile. 
>>>  By the time the state is actually published, it's already stale.  At one 
>>> point we had 400,000 items in the queue and I just had to declare 
>>> bankruptcy, delete the entire queue, and elect a new overseer.  Later, we 
>>> had 70,000 items from several flaps that took an hour to churn through. 
>>> even after I'd shut down the problematic nodes.  Again, almost entirely 
>>> useless, repetitive work.
>>> 
>>> Digging through ZKController and related code, the current model just seems 
>>> terribly outdated and non-scalable now.  If a node flaps for just a moment, 
>>> do we really need to laboriously update every core's state down, just to 
>>> mark it up again?  What purpose does this serve that isn't already served 
>>> by the global live_nodes presence indication and/or leader election nodes?
>>> 
>>> Rebooting a node creates a similar set of problems, a couple hundred cores 
>>> end up generating thousands of ZK operations to just to back to normal 
>>> state.
>>> 
>>> We're at enough of breaking point that I have to do something here for our 
>>> own cluster.  I would love to put my head together with some of the more 
>>> knowledgeable Solr operations folks to help redesign something that could 
>>> land in master and improve scalability for everyone.  I'd also love to hear 
>>> about any prior art or experiments folks have done.  And if there are 
>>> already efforts in process to address this very issue, apologies for being 
>>> out of the loop.
>>> 
>>> Thanks!
>>> Scott
>>> 
>>> -- 
>>> - Mark 
>>> about.me/markrmiller <http://about.me/markrmiller>
>> -- 
>> - Mark 
>> about.me/markrmiller <http://about.me/markrmiller>
> -- 
> - Mark 
> about.me/markrmiller <http://about.me/markrmiller>
> -- 
> - Mark 
> about.me/markrmiller <http://about.me/markrmiller>
> -- 
> - Mark 
> about.me/markrmiller <http://about.me/markrmiller>

Re: Massive state-update bottleneck at scale

Reply via email to