[
https://issues.apache.org/jira/browse/SOLR-15672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479705#comment-17479705
]
Mark Robert Miller commented on SOLR-15672:
-------------------------------------------
That was likely an early motivation/thought while prototyping. It really ended
up kind of stuck there due to its value in very simply allowing one to count on
zk to ensure there is only one leader and every other potential leader wannabe
or bug will be forced to reckon with through zks simplest recipe - the “dumb”
but fantastic for smaller numbers, distributed lock. You make the node you get
the lock, else you have to retry or knowingly cheat and delete.
You can of course go about that in other ways, but in the face of the surround
landscape, that’s the kind of thing that let zk form some sort of back stop.
Would you want to put it in the cluster state? How complicated would that be to
achieve the same role of being the “fail safe” potential leader election (only
entered after thinking you’d win the standard zk election and then passed a
shard sync check)? I dunno.
I looked at making changes to it myself. Personally, I would not have brought
it into the state.json, I had separated collection structure from state
(state.json not a great name for the structure, but …) and so I looked at
different things. I really didn’t like that you had to read a full znode for
such simple data.
I had pushed most consumption from zk to the zkstatereader. It had one server
side, recursive watcher and just got the state change events streamed to it as
they happened and distributed it out via publish/subscribe type callbacks. So
node ame changes were ridiculously cheap, simple string watcher events
streaming down a persistent connection. So as elections happened or leaders
registered, the zkstatereader is just watching it go by. Except the actual
leader for this case is in the zk node data, not the znodes name. So I was
trying to get the byte array out of their and just have the leader indicated by
the znode name. It ended up being pretty tricky though and I pulled back,
because even with a hugely more stable and vetted leader election processes, I
could find difficult to test and address corner cases where the current
“distributed lock” properties stood in the way of a data loss mistake. A full
or partial cluster restart during a lot of activity where an overseer changed
was the scariest and toughest I’d hit, but there was certainly scope for more
as long as the proper scale and scenarios matched the right code to spot and
verify the right things.
So yeah, redundant info, things I don’t like about it, I can’t any direction
would be a good or bad one, depends on a whole lot either way. I did end up
keeping redundancy myself - I kept the leader and replica stats in the
“structure” json file vs force humans and every consumer to reconstruct the
full view themselves or go through the right intermediary. But I only notified
to take a look on a structure changes, not every time it was updated, and it
wasn’t updated based on client activity, the overseer chose. For code needing
the live stream, the zkstatereader reader supplied that to them by keeping its
structure of the last structure change it was told to get, plus the steam of
state changes it saw from its recursive watcher (except it did still take a
peak at leader nodes explicitly to read the byte array contains the winners
name :().
> Leader Election is flawed.
> ---------------------------
>
> Key: SOLR-15672
> URL: https://issues.apache.org/jira/browse/SOLR-15672
> Project: Solr
> Issue Type: Bug
> Reporter: Mark Robert Miller
> Priority: Major
>
> Filing this not as a work item I’m assigning my to myself, but to note a open
> issue where some notes can accumulate.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]