Re: Massive state-update bottleneck at scale

Walter Underwood Wed, 23 Nov 2016 14:58:09 -0800

Why would other nodes need to see stale state?

If they really need intermediate state changes, that sounds like a problem.


wunder
Walter Underwood
[email protected]
http://observer.wunderwood.org/  (my blog)


> On Nov 23, 2016, at 2:53 PM, Mark Miller <[email protected]> wrote:
> 
> In many cases other nodes need to see a progression of state changes. You 
> really have to clear the deck and try to start from 0. 
> On Wed, Nov 23, 2016 at 5:50 PM Walter Underwood <[email protected] 
> <mailto:[email protected]>> wrote:
> If the queue is local and the state messages are complete, the local queue 
> should only send the latest, most accurate update. The rest can be skipped.
> 
> The same could be done on the receiving end. Suck the queue dry, then choose 
> the most recent.
> 
> If the updates depend on previous updates, it would be a lot more work to 
> compile the latest delta.
> 
> wunder
> 
> Walter Underwood
> [email protected] <mailto:[email protected]>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
> 
> 
>> On Nov 23, 2016, at 2:45 PM, Mark Miller <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> I talked about this type of thing with Jessica at Lucene / Sole revolution. 
>> One thing is, when you reconnect after connecting to ZK, it should now 
>> efficiently set every core as down in a single command, not each core. 
>> Beyond that, any single node knows how fast it's sending overseer updates. 
>> Each should have a governor. If the rate is too high, a node should know 
>> it's best to just forgive up and assume things are screwed. It could try and 
>> reset from ground zero.
>> 
>> There are other things things that can be done, but given the current 
>> design, the simplest win is that a replica can easily prevent itself from 
>> spamming the overseer queue. 
>> 
>> Mark
>> On Wed, Nov 23, 2016 at 5:05 PM Scott Blum <[email protected] 
>> <mailto:[email protected]>> wrote:
>> I've been fighting fires the last day where certain of our solr nodes will 
>> have a long GC pauses that cause them to lose their ZK connection and have 
>> to reconnect.  That would be annoying, but survivable, although obvious it's 
>> something I want to fix.
>> 
>> But what makes it fatal is the current design of the state update queue.
>> 
>> Every time one of our nodes flaps, it ends up shoving thousands of state 
>> updates and leader requests onto the queue, most of them ultimately futile.  
>> By the time the state is actually published, it's already stale.  At one 
>> point we had 400,000 items in the queue and I just had to declare 
>> bankruptcy, delete the entire queue, and elect a new overseer.  Later, we 
>> had 70,000 items from several flaps that took an hour to churn through. even 
>> after I'd shut down the problematic nodes.  Again, almost entirely useless, 
>> repetitive work.
>> 
>> Digging through ZKController and related code, the current model just seems 
>> terribly outdated and non-scalable now.  If a node flaps for just a moment, 
>> do we really need to laboriously update every core's state down, just to 
>> mark it up again?  What purpose does this serve that isn't already served by 
>> the global live_nodes presence indication and/or leader election nodes?
>> 
>> Rebooting a node creates a similar set of problems, a couple hundred cores 
>> end up generating thousands of ZK operations to just to back to normal state.
>> 
>> We're at enough of breaking point that I have to do something here for our 
>> own cluster.  I would love to put my head together with some of the more 
>> knowledgeable Solr operations folks to help redesign something that could 
>> land in master and improve scalability for everyone.  I'd also love to hear 
>> about any prior art or experiments folks have done.  And if there are 
>> already efforts in process to address this very issue, apologies for being 
>> out of the loop.
>> 
>> Thanks!
>> Scott
>> 
>> -- 
>> - Mark 
>> about.me/markrmiller <http://about.me/markrmiller>
> -- 
> - Mark 
> about.me/markrmiller <http://about.me/markrmiller>

Re: Massive state-update bottleneck at scale

Reply via email to