Re: Massive state-update bottleneck at scale

Mark Miller Wed, 23 Nov 2016 15:04:16 -0800

I didn't say stale state though actually. I said state progressions.
On Wed, Nov 23, 2016 at 6:03 PM Mark Miller <[email protected]> wrote:


> Because that's how it works.
> On Wed, Nov 23, 2016 at 5:57 PM Walter Underwood <[email protected]>
> wrote:
>
> Why would other nodes need to see stale state?
>
> If they really need intermediate state changes, that sounds like a problem.
>
> wunder
> Walter Underwood
> [email protected]
> http://observer.wunderwood.org/  (my blog)
>
>
> On Nov 23, 2016, at 2:53 PM, Mark Miller <[email protected]> wrote:
>
> In many cases other nodes need to see a progression of state changes. You
> really have to clear the deck and try to start from 0.
> On Wed, Nov 23, 2016 at 5:50 PM Walter Underwood <[email protected]>
> wrote:
>
> If the queue is local and the state messages are complete, the local queue
> should only send the latest, most accurate update. The rest can be skipped.
>
> The same could be done on the receiving end. Suck the queue dry, then
> choose the most recent.
>
> If the updates depend on previous updates, it would be a lot more work to
> compile the latest delta.
>
> wunder
>
> Walter Underwood
> [email protected]
> http://observer.wunderwood.org/  (my blog)
>
>
> On Nov 23, 2016, at 2:45 PM, Mark Miller <[email protected]> wrote:
>
> I talked about this type of thing with Jessica at Lucene / Sole
> revolution. One thing is, when you reconnect after connecting to ZK, it
> should now efficiently set every core as down in a single command, not each
> core. Beyond that, any single node knows how fast it's sending overseer
> updates. Each should have a governor. If the rate is too high, a node
> should know it's best to just forgive up and assume things are screwed. It
> could try and reset from ground zero.
>
> There are other things things that can be done, but given the current
> design, the simplest win is that a replica can easily prevent itself from
> spamming the overseer queue.
>
> Mark
> On Wed, Nov 23, 2016 at 5:05 PM Scott Blum <[email protected]> wrote:
>
> I've been fighting fires the last day where certain of our solr nodes will
> have a long GC pauses that cause them to lose their ZK connection and have
> to reconnect.  That would be annoying, but survivable, although obvious
> it's something I want to fix.
>
> But what makes it fatal is the current design of the state update queue.
>
> Every time one of our nodes flaps, it ends up shoving thousands of state
> updates and leader requests onto the queue, most of them ultimately
> futile.  By the time the state is actually published, it's already stale.
> At one point we had 400,000 items in the queue and I just had to declare
> bankruptcy, delete the entire queue, and elect a new overseer.  Later, we
> had 70,000 items from several flaps that took an hour to churn through.
> even after I'd shut down the problematic nodes.  Again, almost entirely
> useless, repetitive work.
>
> Digging through ZKController and related code, the current model just
> seems terribly outdated and non-scalable now.  If a node flaps for just a
> moment, do we really need to laboriously update every core's state down,
> just to mark it up again?  What purpose does this serve that isn't already
> served by the global live_nodes presence indication and/or leader election
> nodes?
>
> Rebooting a node creates a similar set of problems, a couple hundred cores
> end up generating thousands of ZK operations to just to back to normal
> state.
>
> We're at enough of breaking point that I *have* to do something here for
> our own cluster.  I would love to put my head together with some of the
> more knowledgeable Solr operations folks to help redesign something that
> could land in master and improve scalability for everyone.  I'd also love
> to hear about any prior art or experiments folks have done.  And if there
> are already efforts in process to address this very issue, apologies for
> being out of the loop.
>
> Thanks!
> Scott
>
> --
> - Mark
> about.me/markrmiller
>
>
> --
> - Mark
> about.me/markrmiller
>
>
> --
> - Mark
> about.me/markrmiller
>
-- 
- Mark
about.me/markrmiller

Re: Massive state-update bottleneck at scale

Reply via email to