[
https://issues.apache.org/jira/browse/SOLR-7065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299026#comment-14299026
]
Erick Erickson edited comment on SOLR-7065 at 1/30/15 6:52 PM:
---------------------------------------------------------------
Yeah, I started to take a whack at it at one point, basically taking control of
the ordering of the election queue but abandoned it due to time constraints.
One problem is that we're bastardizing the whole ephemeral election process in
ZK and resorting to the "tie breaker" code that does things like "find the next
guy and jump down two, unless you're within the first two of the head in which
case do nothing". And the sorting is sensitive to the session ID to boot.
The TestRebalanceLeaders code exercises the shard leader election, we can see
if we can extend it. I'm not sure how robust it is when nodes are flaky.
You mentioned at one point that you wondered whether the whole "watch the guy
in front" and ZKs ephemeral-sequential node was the right way to approach this.
The hack I started still used that mechanism, just took better control of how
nodes were inserted into the leader election queue so I don't think that
approach really addresses why this has spun out of control.
I really wonder if we should change the mechanism. It seems to me that the
fundamental fragility (apart from how hard the code is to understand) is that
if the sequence of who watches which ephemeral node somehow gets out of whack,
there is no mechanism for letting the _other_ nodes in the queue know that
there's a problem that needs to be sorted out which can result in no leaders I
assume. Certainly happened often enough to me.
I wonder if tying leader election into ZK state changes rather than watching
the ephemeral election node-in-front is a better way?
This has _not_ been thought out, but what about something like:
Solr gets a notification of state change from ZK and drops into the "should I
be leader" code which gets significantly less complex.
-1> If I'm not active, ??? Probably just return assuming the next state
change will re-trigger this code.
0> If I'm not in the election queue, put myself at the tail. (handles
mysterious out-of-whack situations)
1> If there is a leader and it's active, return. (if it's in the middle of
going down, we should get another state change when it's down, right?)
2a> If some other node is both active and the preferred leader return (again
depending on a state change message if that node goes down to get back to this
code)
2b> If I'm the preferred leader, take over leadership.
3> If any other node in the leader election queue in front of me is active,
return (state change gets us back here if those nodes are going down).
4> take over leadership.
Since this operates off of state changes to ZK, it seems like it gives us the
chance to recover from weird situations. I don't _think_ it increases traffic,
don't all ZK state changes have to go to all nodes anyway?
I'm not sure in this case whether we even need a leader election queue at all.
Is the clusterstate any less robust than the election queue? Even if it would
be just as good, not sure how you'd express "the node in front". Actually, a
simple counter property in the state for each replica would do it maybe. You'd
set it at one more than any other node in the collection when a node changed
its state to "active". I'll freely admit though, you've seen a lot more in the
weeds here than I have so I'll defer to your experience.
Anyway, let's kick the tires of what's to be done, maybe we can tag-team this.
I consider the above just a jumping-off point to tame this beast. Be glad to
chat if you or anyone else wants to kick it around...
One thing I'm not real clear on is how up-to-date the ZK cluster state is.
Since changing the state is done through the Overseer, how to insure that the
state is current when making decisions?
was (Author: erickerickson):
Yeah, I started to take a whack at it at one point, basically taking control of
the ordering of the election queue but abandoned it due to time constraints.
One problem is that we're bastardizing the whole ephemeral election process in
ZK and resorting to the "tie breaker" code that does things like "find the next
guy and jump down two, unless you're within the first two of the head in which
case do nothing". And the sorting is sensitive to the session ID to boot.
The TestRebalanceLeaders code exercises the shard leader election, we can see
if we can extend it. I'm not sure how robust it is when nodes are flaky.
You mentioned at one point that you wondered whether the whole "watch the guy
in front" and ZKs ephemeral-sequential node was the right way to approach this.
The hack I started still used that mechanism, just took better control of how
nodes were inserted into the leader election queue so I don't think that
approach really addresses why this has spun out of control.
I really wonder if we should change the mechanism. It seems to me that the
fundamental fragility (apart from how hard the code is to understand) is that
if the sequence of who watches which ephemeral node somehow gets out of whack,
there is no mechanism for letting the _other_ nodes in the queue know that
there's a problem that needs to be sorted out which can result in no leaders I
assume. Certainly happened often enough to me.
I wonder if tying leader election into ZK state changes rather than watching
the ephemeral election node-in-front is a better way?
This has _not_ been thought out, but what about something like:
Solr gets a notification of state change from ZK and drops into the "should I
be leader" code which gets significantly less complex.
-1> If I'm not active, ??? Probably just return assuming the next state
change will re-trigger this code.
0> If I'm not in the election queue, put myself at the tail. (handles
mysterious out-of-whack situations)
1> If there is a leader and it's active, return. (if it's in the middle of
going down, we should get another state change when it's down, right?)
2a> If some other node is both active and the preferred leader return (again
depending on a state change message if that node goes down to get back to this
code)
2b> If I'm the preferred leader, take over leadership.
3> If any other node in the leader election queue in front of me is active,
return (state change gets us back here if those nodes are going down).
4> take over leadership.
Since this operates off of state changes to ZK, it seems like it gives us the
chance to recover from weird situations. I don't _think_ it increases traffic,
don't all ZK state changes have to go to all nodes anyway?
I'm not sure in this case whether we even need a leader election queue at all.
Is the clusterstate any less robust than the election queue? Even if it would
be just as good, not sure how you'd express "the node in front". Actually, a
simple counter property in the state for each replica would do it maybe. You'd
set it at one more than any other node in the collection when a node changed
its state to "active". I'll freely admit though, you've seen a lot more in the
weeds here than I have so I'll defer to your experience.
Anyway, let's kick the tires of what's to be done, maybe we can tag-team this.
I consider the above just a jumping-off point to tame this beast. Be glad to
chat if you or anyone else wants to kick it around...
> Let a replica become the leader regardless of it's last published state if
> all replicas participate in the election process.
> ----------------------------------------------------------------------------------------------------------------------------
>
> Key: SOLR-7065
> URL: https://issues.apache.org/jira/browse/SOLR-7065
> Project: Solr
> Issue Type: Improvement
> Reporter: Mark Miller
> Assignee: Mark Miller
> Attachments: SOLR-7065.patch, SOLR-7065.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]