On Fri, Oct 14, 2022 at 6:11 AM Mark Miller <markrmil...@gmail.com> wrote:

> I don’t have much to say about the proposal, other than to say that if an
> election ever ends up involving syncing up and exchanging data, doing that
> just in time is probably less than ideal for most of the more common uses
> cases.
>

I should emphasize that replicas continue to want to sync-up with their
leaders on their own -- eagerly.  See RecoveringCoreTermWatcher.  I could
imagine an option to allow one to have this be lazy as well but I'm not
proposing that now.


> That’s just an aside though. Id be more interested in seeing the proposal
> connect problems with solutions.


The #1 point of the proposal is robustness/stability and not a specific
bug.  Instead of wondering why there isn't a leader, this proposal would
log information about the pertinent state and elect a leader if one is
eligible.  No wondering why one wasn't elected.  Still, of course we might
wonder other things based on the logged information (of course), but you're
then closer to debugging what's happening.  This is more user/operator
friendly.


> My quick read makes me think the goal is
> some dimension of scale (I’m guessing a lazy dimension, usually no the most
> common Solr architecture in my experience fwiw). But I don’t see what the
> problems are for that dimension of scale or how to connect proposals to
> solutions to the problems. Unless I’m just missing it.
>

I think the proposal somewhat indirectly addresses a dimension of scale
involving tens of thousands of shards (2x more replicas) across a large
SolrCloud cluster, and using a simple node restart as an example.  Assuming
replicas continue to try to be in sync based on ZkShardTerms (sorry I
wasn't clear on that), the actual choice of who is leader need not happen
eagerly; it's premature I say.

BTW, the proposal's strategy is complementary with additional leader
algorithms being in-place, though I don't think we need both.  The most
important mechanism is ZkShardTerms which is already in-place to govern
leader eligibility; it's brilliant. With that complicated problem already
solved, we're merely left with simply picking an eligible replica and doing
it (recording it in ZK) -- so just pick one already and be done with it :-)

Reply via email to