I've been fighting fires the last day where certain of our solr nodes will
have a long GC pauses that cause them to lose their ZK connection and have
to reconnect.  That would be annoying, but survivable, although obvious
it's something I want to fix.

But what makes it fatal is the current design of the state update queue.

Every time one of our nodes flaps, it ends up shoving thousands of state
updates and leader requests onto the queue, most of them ultimately
futile.  By the time the state is actually published, it's already stale.
At one point we had 400,000 items in the queue and I just had to declare
bankruptcy, delete the entire queue, and elect a new overseer.  Later, we
had 70,000 items from several flaps that took an hour to churn through.
even after I'd shut down the problematic nodes.  Again, almost entirely
useless, repetitive work.

Digging through ZKController and related code, the current model just seems
terribly outdated and non-scalable now.  If a node flaps for just a moment,
do we really need to laboriously update every core's state down, just to
mark it up again?  What purpose does this serve that isn't already served
by the global live_nodes presence indication and/or leader election nodes?

Rebooting a node creates a similar set of problems, a couple hundred cores
end up generating thousands of ZK operations to just to back to normal
state.

We're at enough of breaking point that I *have* to do something here for
our own cluster.  I would love to put my head together with some of the
more knowledgeable Solr operations folks to help redesign something that
could land in master and improve scalability for everyone.  I'd also love
to hear about any prior art or experiments folks have done.  And if there
are already efforts in process to address this very issue, apologies for
being out of the loop.

Thanks!
Scott

Reply via email to