That covers a lot of current silliness you will see, pretty simply as most
of it comes down remove silly stuff, but you can find some related wildness
in ZkController#register.

// check replica's existence in clusterstate first

zkStateReader.waitForState(collection, 100, TimeUnit.MILLISECONDS,
    (collectionState) -> getReplicaOrNull(collectionState, shardId,
coreZkNodeName) != null);

100ms wait, no biggie, and at least it uses waitForState, but we
should not need to get our own clusterstate from zk so here care about
waiting for this here - if there is an item of data we need, it should
have been passed into the core create call.

Next we get the shard terms object so we can later create our shard
terms entry (LIR).

Slow and bug inducing complicated to have each replica do this here,
fighting each other to add an initial entry. You can create the
initial shard terms for a replica when you create or update the
clusterstate (term {replicaname=0}), and you can do it in

a single zk call.


// in this case, we want to wait for the leader as long as the leader might
// wait for a vote, at least - but also long enough that a large cluster has
// time to get its act together
String leaderUrl = getLeader(cloudDesc, leaderVoteWait + 600000);

Now we do getLeader, a polling operation that should not be, and wait
possibly forever for it. As I mention there should be little wait at
most in the notes on leader sync, there should be little wait here.
It's also

one of a variety of places that even if you remove the polling, sucks
to wait on. I'm a fan of thousands of cores per machine not being an
issue. In many of these cases, you can't achieve that and have 1000
threads hanging out

all over even if they are not blind polling. This is one of the
simpler cases where that can be addressed. I break this method into
two and I enhance zkstatereader waitforstate functionality. I allow
you to pass a runnable to execute

when zkstatereader is notified and the given predicate matches. So no
need for 1000's or hundreds or dozens of slackers here. Do a couple
base register items, call wait for state with a runnable that calls
the second part of the logic

when a leader comes into zkstatereader and go away. We can't eat up
threads like this in all these cases.

Now you can also easily shutdown and reload cores and do various
things that are currently harassed by various waits like this slacking
off in these wait loops.



The rest is just continuation of this game when it comes to leader
selection and finalization and collection creation and replica spin
up. You make zkstatereader actually efficient. You make multiple and
lazy collections work appropriately,

and not super inefficient.

You make leader election a sensible bit of code. As part of
zkstatereader sensibility you remove the need for a billion client
based watches in zk and in many cases the need for a thousand watcher
implementations and instances.

You let the components dictate how often requests go to services and
coalesce dependent code requests instead of letting the dependents
dictate service request cadence and size, and you do a lot less
sillines like serialize large json

structures for bit size data updates, and scaling to 10's of k and
even 100's of k replicas and collections is doable even

on single machines and a handful of Solr instances, say nothing about
pulling in more hardware. Everything required is cheap cheap cheap.
It's the mountain of unrequired that is expensive expensive expensive.


On Fri, Oct 1, 2021 at 12:47 PM Mark Miller <[email protected]> wrote:

> Ignoring lots of polling, inefficiencies, early defensive raw sleeps,
> various races and bugs and a laundry list of items involved in making
> leader processes good enough to enter a collection creation contest, here
> is a more practical small set of notes off the top of my head on a quick
> inspection around what is currently just in your face non sensible.
>
> https://gist.github.com/markrmiller/233119ba84ce39d39960de0f35e79fc9
>


-- 
- Mark

http://about.me/markrmiller

Reply via email to