At work, I’ve attempted to troubleshoot issues relating to shard leadership. It’s quite possible that the root-causes may be related to customizations in my fork of Solr; who knows. The leadership code/algorithm is so hard to debug/troubleshoot that it’s hard to say. It’s no secret that Solr’s code here is a complicated puzzle[1]. Out of this frustration, I began to ponder a fantasy of how I want leader election to work, informed by my desire to scale to massive numbers of collections & shards on a cluster. Using Curator for elections would perhaps address stability but not this scale. I’d like to get input from you all on this fantasy. Surely I have overlooked things; please offer your insights!
Thematic concept: Don’t change/elect leaders until it’s actually necessary. In most cases where I work, the leader will return before we truly need a leader. Even when not true, I don’t think doing it lazily should be a noticeable issue? If so, it’s easy to imagine augmenting this design with an optional eager leadership election. A. Only code paths that truly need a leader will do “leadership checks”, resulting in a potential leader election. This is principally on indexing in DistributedZkUpdateProcessor but there are likely more spots. B. Leader check: Check if the shard’s leader is (a) known, and (b) state=ACTIVE, and (c) on a “live” node, and (d) the preferredLeader condition is satisfied. Otherwise, try to elect a leader in a loop until this set of conditions is achieved or a timeout is reached. B.A: The preferredLeader condition means that either the leader is marked as preferredLeader, or no replica with preferredLeader is eligible for leadership. C. “Try to elect a leader”: (The word “election” might not be the best word for this algorithm, but whatever). C.1.: A replica must be eligible to be a leader. It must be live (on a live node) and have an ACTIVE state. And, very important, eligibility should be governed by ZkShardTerms which knows which replicas have the most up-to-date state. C.1.A: Strict use of ZkShardTerms is designed to ensure that there is no data loss. That said “forceLeader” remains in the toolbox of Solr admins (which monkey’s with ZkShardTerms to cheat). We may need a new optional mechanisms to be closer to what we have today — to basically ignore ZkShardTerms after a configured period of time? C.1.B. I am assuming that replicas will become eligible on their own (e.g. as nodes re-join) instead of this algorithm needing to initiate/tell any to get into this state somehow. C.2: If there are no leader-eligible replicas, complain with useful information to diagnose why no leader was found. Don’t log this if we already logged this same message in our leadership check loop. Sleep perhaps 1000ms and try the loop again. If we can wait/monitor on the state of something convenient then do that to avoid sleeping for too long. C.3: Of the leader-eligible replicas — pick whichever one as the leader (e.g. random). Prefer preferredLeader=true replicas, of course. ZK will solve races if this algorithm runs on more than one node. D. Only track leadership in Slice (i.e. within the cluster state) which is backed by one spot in ZK. Don’t put it in places like CloudDescriptor or other places in ZK. Thoughts? [1] https://lists.apache.org/list?dev@solr.apache.org:2021-10:MILLER%20leader “ZkCmdExecutor” thread with Mark Miller, and referencing https://www.solrdev.io/leader-election-adventure.html which no longer resolves ~ David Smiley Apache Lucene/Solr Search Developer http://www.linkedin.com/in/davidwsmiley