[jira] [Comment Edited] (SOLR-7065) Let a replica become the leader regardless of it's last published state if all replicas participate in the election process.

Erick Erickson (JIRA) Fri, 30 Jan 2015 10:54:05 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-7065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299026#comment-14299026
 ]


Erick Erickson edited comment on SOLR-7065 at 1/30/15 6:52 PM:
---------------------------------------------------------------

Yeah, I started to take a whack at it at one point, basically taking control of 
the ordering of the election queue but abandoned it due to time constraints. 
One problem is that we're bastardizing the whole ephemeral election process in 
ZK and resorting to the "tie breaker" code that does things like "find the next 
guy and jump down two, unless you're within the first two of the head in which 
case do nothing". And the sorting is sensitive to the session ID to boot. 

The TestRebalanceLeaders code exercises the shard leader election, we can see 
if we can extend it. I'm not sure how robust it is when nodes are flaky.

You mentioned at one point that you wondered whether the whole "watch the guy 
in front" and ZKs ephemeral-sequential node was the right way to approach this. 
The hack I started still used that mechanism, just took better control of how 
nodes were inserted into the leader election queue so I don't think that 
approach really addresses why this has spun out of control.

I really wonder if we should change the mechanism. It seems to me that the 
fundamental fragility (apart from how hard the code is to understand) is that 
if the sequence of who watches which ephemeral node somehow gets out of whack, 
there is no mechanism for letting the _other_ nodes in the queue know that 
there's a problem that needs to be sorted out which can result in no leaders I 
assume. Certainly happened often enough to me.

I wonder if tying leader election into ZK state changes rather than watching 
the ephemeral election node-in-front is a better way?

This has _not_ been thought out, but what about something like:

Solr gets a notification of state change from ZK and drops into the "should I 
be leader" code which gets significantly less complex.
  -1> If I'm not active, ??? Probably just return assuming the next state 
change will re-trigger this code.
  0> If I'm not in the election queue, put myself at the tail. (handles 
mysterious out-of-whack situations)
  1> If there is a leader and it's active, return. (if it's in the middle of 
going down, we should get another state change when it's down, right?)
  2a> If some other node is both active and the preferred leader return (again 
depending on a state change message if that node goes down to get back to this 
code)
  2b> If I'm the preferred leader, take over leadership.
  3> If any other node in the leader election queue in front of me is active, 
return (state change gets us back here if those nodes are going down).
  4> take over leadership.

Since this operates off of state changes to ZK, it seems like it gives us the 
chance to recover from weird situations. I don't _think_ it increases traffic, 
don't all ZK state changes have to go to all nodes anyway?

I'm not sure in this case whether we even need a leader election queue at all. 
Is the clusterstate any less robust than the election queue? Even if it would 
be just as good, not sure how you'd express "the node in front". Actually, a 
simple counter property in the state for each replica would do it maybe. You'd 
set it at one more than any other node in the collection when a node changed 
its state to "active". I'll freely admit though, you've seen a lot more in the 
weeds here than I have so I'll defer to your experience.

Anyway, let's kick the tires of what's to be done, maybe we can tag-team this. 
I consider the above just a jumping-off point to tame this beast. Be glad to 
chat if you or anyone else wants to kick it around...

One thing I'm not real clear on is how up-to-date the ZK cluster state is. 
Since changing the state is done through the Overseer, how to insure that the 
state is current when making decisions?


was (Author: erickerickson):
Yeah, I started to take a whack at it at one point, basically taking control of 
the ordering of the election queue but abandoned it due to time constraints. 
One problem is that we're bastardizing the whole ephemeral election process in 
ZK and resorting to the "tie breaker" code that does things like "find the next 
guy and jump down two, unless you're within the first two of the head in which 
case do nothing". And the sorting is sensitive to the session ID to boot. 

The TestRebalanceLeaders code exercises the shard leader election, we can see 
if we can extend it. I'm not sure how robust it is when nodes are flaky.

You mentioned at one point that you wondered whether the whole "watch the guy 
in front" and ZKs ephemeral-sequential node was the right way to approach this. 
The hack I started still used that mechanism, just took better control of how 
nodes were inserted into the leader election queue so I don't think that 
approach really addresses why this has spun out of control.

I really wonder if we should change the mechanism. It seems to me that the 
fundamental fragility (apart from how hard the code is to understand) is that 
if the sequence of who watches which ephemeral node somehow gets out of whack, 
there is no mechanism for letting the _other_ nodes in the queue know that 
there's a problem that needs to be sorted out which can result in no leaders I 
assume. Certainly happened often enough to me.

I wonder if tying leader election into ZK state changes rather than watching 
the ephemeral election node-in-front is a better way?

This has _not_ been thought out, but what about something like:

Solr gets a notification of state change from ZK and drops into the "should I 
be leader" code which gets significantly less complex.
  -1> If I'm not active, ??? Probably just return assuming the next state 
change will re-trigger this code.
  0> If I'm not in the election queue, put myself at the tail. (handles 
mysterious out-of-whack situations)
  1> If there is a leader and it's active, return. (if it's in the middle of 
going down, we should get another state change when it's down, right?)
  2a> If some other node is both active and the preferred leader return (again 
depending on a state change message if that node goes down to get back to this 
code)
  2b> If I'm the preferred leader, take over leadership.
  3> If any other node in the leader election queue in front of me is active, 
return (state change gets us back here if those nodes are going down).
  4> take over leadership.

Since this operates off of state changes to ZK, it seems like it gives us the 
chance to recover from weird situations. I don't _think_ it increases traffic, 
don't all ZK state changes have to go to all nodes anyway?

I'm not sure in this case whether we even need a leader election queue at all. 
Is the clusterstate any less robust than the election queue? Even if it would 
be just as good, not sure how you'd express "the node in front". Actually, a 
simple counter property in the state for each replica would do it maybe. You'd 
set it at one more than any other node in the collection when a node changed 
its state to "active". I'll freely admit though, you've seen a lot more in the 
weeds here than I have so I'll defer to your experience.

Anyway, let's kick the tires of what's to be done, maybe we can tag-team this. 
I consider the above just a jumping-off point to tame this beast. Be glad to 
chat if you or anyone else wants to kick it around...

> Let a replica become the leader regardless of it's last published state if 
> all replicas participate in the election process.
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-7065
>                 URL: https://issues.apache.org/jira/browse/SOLR-7065
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Mark Miller
>            Assignee: Mark Miller
>         Attachments: SOLR-7065.patch, SOLR-7065.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-7065) Let a replica become the leader regardless of it's last published state if all replicas participate in the election process.

Reply via email to