Re: Massive state-update bottleneck at scale

Walter Underwood Wed, 23 Nov 2016 14:50:59 -0800

If the queue is local and the state messages are complete, the local queue 
should only send the latest, most accurate update. The rest can be skipped.


The same could be done on the receiving end. Suck the queue dry, then choose 
the most recent.

If the updates depend on previous updates, it would be a lot more work to 
compile the latest delta.

wunder
Walter Underwood
[email protected]
http://observer.wunderwood.org/  (my blog)


> On Nov 23, 2016, at 2:45 PM, Mark Miller <[email protected]> wrote:
> 
> I talked about this type of thing with Jessica at Lucene / Sole revolution. 
> One thing is, when you reconnect after connecting to ZK, it should now 
> efficiently set every core as down in a single command, not each core. Beyond 
> that, any single node knows how fast it's sending overseer updates. Each 
> should have a governor. If the rate is too high, a node should know it's best 
> to just forgive up and assume things are screwed. It could try and reset from 
> ground zero.
> 
> There are other things things that can be done, but given the current design, 
> the simplest win is that a replica can easily prevent itself from spamming 
> the overseer queue. 
> 
> Mark
> On Wed, Nov 23, 2016 at 5:05 PM Scott Blum <[email protected] 
> <mailto:[email protected]>> wrote:
> I've been fighting fires the last day where certain of our solr nodes will 
> have a long GC pauses that cause them to lose their ZK connection and have to 
> reconnect.  That would be annoying, but survivable, although obvious it's 
> something I want to fix.
> 
> But what makes it fatal is the current design of the state update queue.
> 
> Every time one of our nodes flaps, it ends up shoving thousands of state 
> updates and leader requests onto the queue, most of them ultimately futile.  
> By the time the state is actually published, it's already stale.  At one 
> point we had 400,000 items in the queue and I just had to declare bankruptcy, 
> delete the entire queue, and elect a new overseer.  Later, we had 70,000 
> items from several flaps that took an hour to churn through. even after I'd 
> shut down the problematic nodes.  Again, almost entirely useless, repetitive 
> work.
> 
> Digging through ZKController and related code, the current model just seems 
> terribly outdated and non-scalable now.  If a node flaps for just a moment, 
> do we really need to laboriously update every core's state down, just to mark 
> it up again?  What purpose does this serve that isn't already served by the 
> global live_nodes presence indication and/or leader election nodes?
> 
> Rebooting a node creates a similar set of problems, a couple hundred cores 
> end up generating thousands of ZK operations to just to back to normal state.
> 
> We're at enough of breaking point that I have to do something here for our 
> own cluster.  I would love to put my head together with some of the more 
> knowledgeable Solr operations folks to help redesign something that could 
> land in master and improve scalability for everyone.  I'd also love to hear 
> about any prior art or experiments folks have done.  And if there are already 
> efforts in process to address this very issue, apologies for being out of the 
> loop.
> 
> Thanks!
> Scott
> 
> -- 
> - Mark 
> about.me/markrmiller <http://about.me/markrmiller>

Re: Massive state-update bottleneck at scale

Reply via email to