Re: I think rejoin leader elections at the head isn't doing what it should

Jessica Mallet Tue, 02 Dec 2014 17:35:44 -0800

This is reminiscent of my conversation with Noble on this SOLR-6095
starting at this comment:
https://issues.apache.org/jira/browse/SOLR-6095?focusedCommentId=14032386&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14032386


Unfortunately I dropped off following it and my memory is a bit vague right
now. Reading from the comments, I think Noble had in mind that the
tie-breaker can pick the wrong node (n2) to be the leader, but then the
wrong node will then re-initiate the process to renounce leadership and
re-join (according to
https://issues.apache.org/jira/browse/SOLR-6095?focusedCommentId=14032619&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14032619
).

I then asked about when that renounce process will happen for n2 (
https://issues.apache.org/jira/browse/SOLR-6095?focusedCommentId=14032659&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14032659),
and I'm not sure if that was ever specifically answered. Figuring if and
how that happens might be key in moving forward?

Jessica

On Tue, Dec 2, 2014 at 4:30 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> I'm particularly interested in Noble and Mark's comments...
>
> Let's say you have 5 nodes in n1, n2, n3, n4, n5.
>
> n1 is the leader, n2 watches n1 etc.
>
> Now I retryElection for n3 with joinAtHead=true. Both n2 and n3 are
> watching n1. So far, so good.
>
> My expectation is that deleting n1 would cause n3 to become leader,
> but it isn't at all guaranteed. I have a test case illustrating this.
>
> Incidentally, I think I should get the same result by calling
> retryElection on n1 with joinAtHead=false; n3 should become the
> leader.
>
> I was working on SOLR-6691 and slowly going crazy since everything I
> was trying would fail. Basically, to rebalance leaders (thanks Noble
> for pointing out how far off I was in my original approach) it seemed
> like it would be sufficient to
>
> 1> have the preferred leader retry the election at the head
> 2> tell the old leader to retry at the tail
>
> I expected the old node that was watching the leader to figure out
> that it wasn't really next in line and re-add itself to the end.
>
> But things went all to hell in a handbasket when I wrote a harness
> that exercised it, and it drove me a bit nuts. Especially since it
> would fail one way one time and another way the next. And it'd even
> succeed upon occasion....
>
> I figured out that my expectations weren't being met. Due to the way
> leader queues are sorted, if the two sequence numbers are identical
> then the tie-breaker does NOT pick the last node to join at head.  It
> picks the one with the lowest (highest? didn't track that down
> entirely) session ID. Either way, sometimes it picks the node newly
> added at the head and sometimes it picks the old one.
>
> If I _am_ on the right path, then I propose the following:
> 1> I'll raise a new JIRA for leader sequence sorting and take it on.
> I'm not quite sure how fix it, the ideas I have are fairly hacky.
>
> 2> I'll back out the REBALANCELEADER  stuff. Currently it'll break
> things badly and we're too close to 5.0 to try to do anything about
> <1> IMO. this just means that I'll comment out the collections API
> call in the code and update the ref guide.
>
> 3> When <1> is resolved, I'll put REBALANCELEADERs back in, but that
> won't be before 5.1
>
> Erick
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: I think rejoin leader elections at the head isn't doing what it should

Reply via email to