Hi Flavio,
Here is my attempt:
Lets assume a large cluster and look at three nodes.
ep = electionEpoch, p = peerEpoch, z = zxidstep 1:A [ ep:3, p:1, z: 1 ] [
LOOKING ] { failed to follow multiple times, hence resets Vote to stored values
and bumps up epoch }B [ ep:1, p:2, z: 9 ] [ FOLLOWING ]C [ ep:1, p:2, z: 9 ] [
LEADING ]
step 2:
B goes LOOKING state but cannot reach A or C
A [ ep:3, p:1, z: 1 ] [ LOOKING ] B [ ep:2, p:2, z: 99 ] [ LOOKING ] { starts
vote with last committed transactions }C [ ep:1, p:2, z: 99 ] [ LEADING ]
step 3:
C goes into LOOKING
A [ ep:3, p:1, z: 1 ] [ LOOKING ] B [ ep:2, p:2, z: 99 ] [ LOOKING ]C [ ep:2,
p:2, z: 999 ] [ LOOKING ]
step 4:
Only B and C can reach each other so they converge.
A [ ep:3, p:1, z: 1 ] [ LOOKING ] B [ ep:2, p:2, z: 999 ] [ LOOKING ]C [ ep:2,
p:2, z: 999 ] [ LOOKING ]
step 5:
B hears from A now and A still cannot see C
bit.ly/1kxjk5G: A.ep > B.logicalClock and totalOrderPredicate(A, B) is false
hence B resets to values stored on disk i.e lost values it learned from C but
copies the logicalClock
A [ ep:3, p:1, z: 1 ] [ LOOKING ] B [ ep:3, p:2, z: 99 ] [ LOOKING ] { moved
back zxid }C [ ep:2, p:2, z: 999 ] [ LOOKING ]
Now A and B converge
step 6:A [ ep:3, p:2, z: 99 ] [ LOOKING ] B [ ep:3, p:2, z: 99 ] [ LOOKING ]C [
ep:2, p:2, z: 999 ] [ LOOKING ]
My initial question is - what is goal of the resetting the proposal to on disk
values when Rx proposal electionEpoch is greater than current logical clock and
totalOrderPredicate(Rx, this) is false in step 5 above. This caused B to
unlearn and re-learn.
Also if you can shed some light regarding the role of ElectionEpoch that would
be great. Is this due to the fact a Vote received by LeaderElection could be
stale and forcing system to converge on an ElectionEpoch helps with liveliness
? i.e electing a leader with reasonably latest votes?. But why doesn't it
consider Vote from LEADER/FOLLOWER for learning. Why is it necessary to learn
only from LOOKING peers and not from LEADER/FOLLOWER.
Here is another case to illustrate this problem:
A[K], B[K], C[F], D[L], E[F] { K = looking, F = following, L = leading }
In a partitioned system here where A and B can see C and D but not E, C and D
can see all. In this case A and B will never go to following state and follow D
since both of them will never learn from out of election peers (exception is
when an out of election peer has the same election epoch as current logical
clock).
Here the system is working without the participation of A and B.
Any help is appreciated.
thanksPowell.
On Tuesday, January 5, 2016 8:55 AM, Flavio Junqueira <[email protected]>
wrote:
Hi Powell,
I don't understand why you want to reset the values of the server vote when the
totalOrderPredicate check fails. The values you're referring to are epoch and
zxid?
In the example you give, it looks like you're saying that the vote of B wins
over the vote of C and the one of C wins over the one of A, so the order is B >
C > A, but A shouldn't take C's vote because it already took B's and B's vote
win. If that's the case, then this already happens. I'm probably missing the
point here, so perhaps you could provide an example with more detail, like with
epoch numbers and such to illustrate the point.
-Flavio
> On 01 Jan 2016, at 04:19, Powell Molleti <[email protected]> wrote:
>
> Hi,
> I want to better understand the use of code here: http://bit.ly/1kxjk5GWhy
> should FLE reset the Vote to what is on the disk/initial values when
> totalOrderPredicate() fails in the case of received ElectionEpoch being
> greater than current vote's ElectionEpoch.
> Going back to initial values(and clearing the recv set) does not seem to make
> it incorrect but seems to slow down FLE if I am not mistaken. For example if
> B has the best totalOrderPredicate() and A learns from it and if C has higher
> election epoch but older values then A is forced to reset what it learned
> from B till C and B catch up to each other?. Rather than let A and B wait for
> C to upgrade its values after A and B borrow its ElectionEpoch?.
> Any help is appreciated.
> thanksPowell.