Wenhan Feng created ZOOKEEPER-4892: -------------------------------------- Summary: Inconsistency of the quorum rule of the leader election process between the Zookeeper implmentation and specification. Key: ZOOKEEPER-4892 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4892 Project: ZooKeeper Issue Type: Bug Components: quorum Affects Versions: 3.9.3 Environment: Run Zookeeper specification with TLA+ Toolbox 1.8.0 in Win11.
Run Zookeeper implementation in ubuntu 220.4 Reporter: Wenhan Feng Attachments: Inconsistency in quorom rule.png It seems the quorum rule of the leader election process between the Zookeeper implmentation and specification is inconsistent. In {*}implementation{*}, according to the code in the _FastLeaderElection_ class, a node maintains the votes of all nodes in the cluster it considers through a `recvset` (which are modeled into _receiveVotes_ variable in the specification). At the same time, it maintains its own vote through the variables {_}proposedLeader{_}, {_}proposedZxid{_}, and _proposedEpoch_ (which are modeled into _currentVote_ variable in the specification). The quorum rule is that when there are votes from a majority of the nodes in the _recvset_ that match its own vote ({_}proposedLeader{_}, {_}proposedZxid{_}, and _proposedEpoch_) , the node considers the quorum rule is satisfied and moves to the next stage of the election. Related codes: {code:java} // Line 761-780 in FastLeaderElection.java protected SyncedLearnerTracker getVoteTracker(Map<Long, Vote> votes, Vote vote) { .... for (Map.Entry<Long, Vote> entry : votes.entrySet()) { if (vote.equals(entry.getValue())) { voteSet.addAck(entry.getKey()); } } return voteSet; } // Line 1045 in FastLeaderElection.java voteSet = getVoteTracker(recvset, new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch)); {code} In {*}specification{*}, according to the action _FLEHandleNotmsg_ and _HandleNotmsg_, the node itself is initially included in the set of nodes that have reached consensus. Then, the specification checks whether the votes from other nodes in the _rreceiveVotes_ match the node's _currentVote_. If the number of nodes whose votes match the node's _currentVote_ reaches a majority with the node itself, the quorum is considered reached. Related codes: {code:java} \* Line 216-222 in FastLeaderElection.tla. VoteSet(i, msource, rcvset, thisvote, thisround) == \{msource} \union \{s \in (Server \ {msource}): VoteEqual(rcvset[s].vote, rcvset[s].round, thisvote, thisround)} HasQuorums(i, msource, rcvset, thisvote, thisround) == LET Q == VoteSet(i, msource, rcvset, thisvote, thisround) IN IF Q \in Quorums THEN TRUE ELSE FALSE \* Line 433 in HandleNotmsg action of FastLeaderElection.tla. /\ LET hasQuorums == HasQuorums(i, i, receiveVotes'[i], currentVote'[i], n.mround) {code} ``` ``` We can see that these two mechanisms are not entirely consistent. Consider the following three-node scenario: 1. The cluster starts, and at this point, node 1's _recvset_ contains votes from nodes 1, 2, and 3, all of which are initial votes (denoted as `InitialVote`). 2. Node 3 sends a _Notification-1_ (leader=N3, zxid=1, electionEpoch=1) to nodes 1, 2, and 3. 3. Node 1 receives _Notification-1_ from node 3. 4. Node 1 updates its vote to _Notification-1_ and sends _Notification-1_ to nodes 1, 2, and 3 again. 5. Node 1 updates its _recvset_. At this point, node 1's `recvset` is \{N1: _InitialVote_, N2: _InitialVote_, N3: _Notification-1_}. 6. According to the specification, node 1 now believes that quorom nodes have reached consensus. However, according to the implementation, node 1 needs to receive the _Notification-1_ sent by itself in step 4 and update its _recvset_ (i.e., node 1's vote in the `recvset` must also be set to _Notification-1_) before it can consider the quorum reached. This figure "Inconsistency in quorom rule.png" in attachment shows the system state after step 5 more detailedly. I noticed that in the current version of the TLA+ specification, developers are attempting to avoid modeling the process of a node sending a message to itself and processing its own message. However, this simplification seems to lead to: 1. Inconsistencies between key algorithms in the specification and the implementation. In the example above, the inconsistent quorum rule caused me some confusion when I read the code. 2. Omission of some test scenarios. In the example above, There are some possible concurrent behaviors in the cluster after step 5 but before node 1 receives its own `Notification` and updates its `recvset` in step 6. These concurrent behaviors cannot be verified with the current version of the specification. I believe this issue can be fixed with the following two possible solutions: 1. *Simpler solution*: In the _HandleNotmsg\(i\)_ action of the specification, update _receiveVotes[i]_ after performing `UpdateProposal\(i\)`. Then, modify the _HasQuorums_ action to match the implementation's quorum rule. 2. *Complete solution*: Modify the _BroadcastNotmsg_ action in the specification to include _Notification_ messages sent to itself in the _recvqueue_, as is actually in the implementation (currently, the specification directly ignores such messages). Also, improve the handling logic for these types of messages in _FLEReceiveNotmsg_ and _FLEHandleNotmsg_ actions. Finally, modify the _HasQuorums_ action to match the implementation's quorum rule. If this issue is confirmed, I would be happy to provide a PR. -- This message was sent by Atlassian Jira (v8.20.10#820010)