Hi André, André Oriani <[email protected]> writes:
> Hi , > > Where can I find a good description of the fast leader election used > by ZooKeeper. I would like to understand better how logical clock is > used and the two termination conditions ( a quorum has voted for a > certain peer or have received votes from all peers) , if I have > identified correctly. I don't think there is such a description anywhere. Here is what I understand. The logical clocks are used to identify different instances of the election. Every time the leader crashes, the peers time out and the logical clock is incremented. The same happens if the election fails to finish: a new "election instance" or "epoch" is started. In FLE each peer proposes its own (zxid,id). The algorithm then converges by always broadcasting the highest pair. The hope is that the peers should have converged before the algorithm times out. If a message is received from a future epoch, the peer starts participating in the this epoch. Because zab only works when there is a quorum of nodes up and connected, there is an additional condition to return from LookForLeader: a peer has to have received messages from a quorum of peers. This avoids failing the election before enough nodes are up. In fact, the timeout is only started after a quorum has answered (is up). This timeliness requirement is however strictly stronger than the one required by zab itself. Note that if all peers replied in the same epoch, there is no need to wait a timeout. There is a second termination case (in the default of the switch). That is the case when a peer starts an election when there is already a leader. The peer gets "there-is-already-a-leader-for-epoch-X" messages back. If a quorum of peers answered with the same leader and epoch, lookForLeader returns. Important in both termination cases (electing a new leader, and getting the actual leader) is that the check for a quorum of votes guarantees that the candidate (or leader) is in this quorum. This avoids the case where the ensemble repeatedly elects a crashed peer. > The first thing QuorumCxnManager when connecting to a peer is to send > its id. Is this done to avoid any problem with concurrent connections? Yes. Peer A opens a TCP connection to peer B, and vice-versa. Both send their ids, the highest id wins (the other connection is closed). Ciao, Diogo > Thanks and Regards, > André Oriani
