I would be more in favor of going back to the drawing board on leader
election than incremental improvements.  Go back to first principles.  The
clarity just isn't there to be maintained.  I don't trust it.

Coincidentally I sent a message to the Apache Curator users list yesterday
to inquire about leader prioritization:
https://lists.apache.org/thread/lmm30qpm17cjf4b93jxv0rt3bq99c0sb
I suspect the "users" list is too low activity to be useful for the Curator
project; I'm going to try elsewhere.

For shards, there doesn't even need to be a "leader election" recipe
because there are no shard leader threads that always need to be
thinking/doing stuff, unlike the Overseer.  It could be more demand-driven
(assign leader on-demand if needs to be re-assigned), and thus be more
scalable as well for many shards.
Some of my ideas on this:
https://lists.apache.org/thread/kowcp2ftc132pq0y38g9736m0slchjg7

On Mon, Dec 18, 2023 at 11:33 AM Pierre Salagnac <pierre.salag...@gmail.com>
wrote:

> We recently had a couple of issues with production clusters because of race
> conditions in shard leader election. By race condition here, in mean for a
> single node. I'm not discussing how leader election is distributed
> across multiple Solr nodes, but how multiple threads in a single Solr node
> conflict with each other.
>
> On the overall, when two threads (on the same server) concurrently join
> leader election for the same replica, the outcome is unpredictable. it may
> end in two nodes thinking they are the leader or not having any leader at
> all.
> I identified two scenarios, but maybe there are more:
>
> 1. Zookeeper session expires while an election is already in progress.
> When we re-create the Zookeeper session, we re-register all the cores, and
> join elections for all of them. If an election is already in-progress or is
> triggered for any reason, we can have two threads on the same Solr server
> node running leader election for the same core.
>
> 2. Command REJOINLEADERELECTION is received twice concurrently for the same
> core.
> This scenario is much easier to reproduce with an external client. It
> occurs for us since we have customizations using this command.
>
>
> The code for leader election hasn't changed much for a while, and I don't
> understand the full history behind it. I wonder whether multithreading was
> already discussed and/or taken into account. The code has a "TODO: can we
> even get into this state?" that makes me think this issue was already
> reproduced but noy fully solved/understood.
> Since this code has many calls to Zookeeper, I don't think we can just
> "synchronize" it with mutual exclusions, as these calls that involve the
> network can be incredibly slow when something bad happens. We don't want
> any thread to be blocked by another waiting for a remote call to complete.
>
> I would like to get some opinions about making this code more robust to
> concurrency. Unless the main opinion is "no, this code should actually be
> mono threaded !", I can give it a try.
>
> Thanks
>

Reply via email to