Thanks Mark for your write ups! This is an area of SolrCloud I'm currently
actively exploring at work (might publish my notes as well at some point).

I think terms value (fitness to become leader) should participate in the
election node ordering, as well as a terms goal (based on highest known
term for the shard), and clear options of stalled election vs. data loss
(and if data loss is not acceptable, an external process automated or human
likely has to intervene to unblock the stalled election).

Even if all updates hit multiple replicas, nothing guarantees that any of
these copies is present when another replica (without the update) starts.
If we don't want to wait at startup for other replicas to join an election
(this can't scale even though CoreSorter does its best... but is the most
convoluted Comparator I've ever seen) we might need the notion of
"incomplete leader", i.e. a replica that is the current elected leader but
that does not have all data (at some later point we might decide to accept
the loss and consider it's the leader, or when a better positioned replica
joins, have it become leader). This will require quite some assumptions
revisiting, so likely should be associated with a thorough clean up (and a
move to Curator election?).

Ilan



On Sun, Oct 3, 2021 at 4:27 AM Mark Miller <[email protected]> wrote:

> I filed
> https://issues.apache.org/jira/browse/SOLR-15672 Leader Election is
> flawed - for future reference if anyone looks at tackling leader election
> issues. I’ll drop a couple notes and random suggestions there
>
> Mark
>
> On Sat, Oct 2, 2021 at 12:47 PM Mark Miller <[email protected]> wrote:
>
>> At some point digging through some of this stuff, I often start to think,
>> I wonder how good our tests are at catching certain categories of problems.
>> Various groups of code branches and behaviors.
>>
>> I do notice that as I get the test flying, they do start to pick up a lot
>> more issues. A lot more bugs and bad behavior. And as they start to near
>> max out, I start feeling a little better about a lot of it. But then I’m
>> looking at things outside of tests still as well. Using my own tools and
>> setups, using stuff from others. Being cruel in my expectations. And by
>> then I’ve come a long way, but I can still find problems. Run into bad
>> situations. If I push, and when I make it so can push harder, i push even
>> harder. And I want the damn thing solid. Why come all this way if I can’t
>> have really and truly solid. And that’s when I reach for collection
>> creation a mixed with cluster restarts.  How about I shove 100000 SolrCores
>> 10000 collections right down its mouth on a handful of instances on a
>> single machine in like a minute timeframe. How about 30 seconds. How about
>> more collections. How about lower time frames. Vary things around. Let’s
>> just swamp it and demand the setup eats it in silly time frames and stands
>> up at the end correct and happy.  And then I start to get to the bottom of
>> the barrel on what’s subverting my solidness. But as I’ve always said, more
>> and more targeted for tests along with simpler and more understandable
>> implementations will also cover a lot more ground. I certainly have pushed
>> on simpler implementations. I’ve never gotten to the point where I have the
>> energy and time to just push on more, better and more targeted tests, more
>> unit tests, more mockito, more awaitability as Tims suggested, etc.
>> --
>> - Mark
>>
>> http://about.me/markrmiller
>>
> --
> - Mark
>
> http://about.me/markrmiller
>

Reply via email to