Re: Massive state-update bottleneck at scale

Mark Miller Wed, 23 Nov 2016 14:47:20 -0800

I talked about this type of thing with Jessica at Lucene / Sole revolution.
One thing is, when you reconnect after connecting to ZK, it should now
efficiently set every core as down in a single command, not each core.
Beyond that, any single node knows how fast it's sending overseer updates.
Each should have a governor. If the rate is too high, a node should know
it's best to just forgive up and assume things are screwed. It could try
and reset from ground zero.


There are other things things that can be done, but given the current
design, the simplest win is that a replica can easily prevent itself from
spamming the overseer queue.

Mark
On Wed, Nov 23, 2016 at 5:05 PM Scott Blum <[email protected]> wrote:

> I've been fighting fires the last day where certain of our solr nodes will
> have a long GC pauses that cause them to lose their ZK connection and have
> to reconnect.  That would be annoying, but survivable, although obvious
> it's something I want to fix.
>
> But what makes it fatal is the current design of the state update queue.
>
> Every time one of our nodes flaps, it ends up shoving thousands of state
> updates and leader requests onto the queue, most of them ultimately
> futile.  By the time the state is actually published, it's already stale.
> At one point we had 400,000 items in the queue and I just had to declare
> bankruptcy, delete the entire queue, and elect a new overseer.  Later, we
> had 70,000 items from several flaps that took an hour to churn through.
> even after I'd shut down the problematic nodes.  Again, almost entirely
> useless, repetitive work.
>
> Digging through ZKController and related code, the current model just
> seems terribly outdated and non-scalable now.  If a node flaps for just a
> moment, do we really need to laboriously update every core's state down,
> just to mark it up again?  What purpose does this serve that isn't already
> served by the global live_nodes presence indication and/or leader election
> nodes?
>
> Rebooting a node creates a similar set of problems, a couple hundred cores
> end up generating thousands of ZK operations to just to back to normal
> state.
>
> We're at enough of breaking point that I *have* to do something here for
> our own cluster.  I would love to put my head together with some of the
> more knowledgeable Solr operations folks to help redesign something that
> could land in master and improve scalability for everyone.  I'd also love
> to hear about any prior art or experiments folks have done.  And if there
> are already efforts in process to address this very issue, apologies for
> being out of the loop.
>
> Thanks!
> Scott
>
> --
- Mark
about.me/markrmiller

Re: Massive state-update bottleneck at scale

Reply via email to