[
https://issues.apache.org/jira/browse/ZOOKEEPER-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18058539#comment-18058539
]
Arup Chauhan commented on ZOOKEEPER-4892:
-----------------------------------------
Hi, I’d like to work on this, I’ll open a draft PR and link it here.
I plan to start with the simpler, spec-only approach: update receiveVotes[i][i]
after UpdateProposal(i) and adjust HasQuorums/VoteSet so quorum counting
matches FastLeaderElection.java (count matching votes from receiveVotes, no
implicit self inclusion). I’ll validate with TLC on a 3-node model and include
the config/results in the PR.
> Inconsistency of the quorum rule of the leader election process between the
> Zookeeper implmentation and specification.
> ----------------------------------------------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-4892
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4892
> Project: ZooKeeper
> Issue Type: Bug
> Components: quorum
> Affects Versions: 3.9.3
> Environment: Run Zookeeper specification with TLA+ Toolbox 1.8.0 in
> Win11.
> Run Zookeeper implementation in ubuntu 220.4
> Reporter: Wenhan Feng
> Priority: Major
> Attachments: Inconsistency in quorom rule.png
>
>
> It seems the quorum rule of the leader election process between the Zookeeper
> implmentation and specification is inconsistent.
> In {*}implementation{*}, according to the code in the _FastLeaderElection_
> class, a node maintains the votes of all nodes in the cluster it considers
> through a `recvset` (which are modeled into _receiveVotes_ variable in the
> specification). At the same time, it maintains its own vote through the
> variables {_}proposedLeader{_}, {_}proposedZxid{_}, and _proposedEpoch_
> (which are modeled into _currentVote_ variable in the specification). The
> quorum rule is that when there are votes from a majority of the nodes in the
> _recvset_ that match its own vote ({_}proposedLeader{_}, {_}proposedZxid{_},
> and _proposedEpoch_) , the node considers the quorum rule is satisfied and
> moves to the next stage of the election.
> Related codes:
> {code:java}
> // Line 761-780 in FastLeaderElection.java
> protected SyncedLearnerTracker getVoteTracker(Map<Long, Vote> votes, Vote
> vote) {
> ....
> for (Map.Entry<Long, Vote> entry : votes.entrySet()) {
> if (vote.equals(entry.getValue())) {
> voteSet.addAck(entry.getKey());
> }
> }
> return voteSet;
> }
> // Line 1045 in FastLeaderElection.java
> voteSet = getVoteTracker(recvset, new Vote(proposedLeader, proposedZxid,
> logicalclock.get(), proposedEpoch));
> {code}
> In {*}specification{*}, according to the action _FLEHandleNotmsg_ and
> _HandleNotmsg_, the node itself is initially included in the set of nodes
> that have reached consensus. Then, the specification checks whether the votes
> from other nodes in the _rreceiveVotes_ match the node's _currentVote_. If
> the number of nodes whose votes match the node's _currentVote_ reaches a
> majority with the node itself, the quorum is considered reached.
> Related codes:
> {code:java}
> \* Line 216-222 in FastLeaderElection.tla.
> VoteSet(i, msource, rcvset, thisvote, thisround) == \{msource} \union \{s \in
> (Server \ {msource}): VoteEqual(rcvset[s].vote,
>
> rcvset[s].round,
>
> thisvote,
>
> thisround)}
> HasQuorums(i, msource, rcvset, thisvote, thisround) == LET Q == VoteSet(i,
> msource, rcvset, thisvote, thisround)
> IN IF Q \in Quorums
> THEN TRUE ELSE FALSE
> \* Line 433 in HandleNotmsg action of FastLeaderElection.tla.
> /\ LET hasQuorums == HasQuorums(i, i, receiveVotes'[i], currentVote'[i],
> n.mround)
> {code}
> We can see that these two mechanisms are not entirely consistent. Consider
> the following three-node scenario:
> 1. The cluster starts, and at this point, node 1's _recvset_ contains votes
> from nodes 1, 2, and 3, all of which are initial votes (denoted as
> `InitialVote`).
> 2. Node 3 sends a _Notification-1_ (leader=N3, zxid=1, electionEpoch=1) to
> nodes 1, 2, and 3.
> 3. Node 1 receives _Notification-1_ from node 3.
> 4. Node 1 updates its vote to _Notification-1_ and sends _Notification-1_ to
> nodes 1, 2, and 3 again.
> 5. Node 1 updates its _recvset_. At this point, node 1's `recvset` is \{N1:
> _InitialVote_, N2: _InitialVote_, N3: _Notification-1_}.
> 6. According to the specification, node 1 now believes that quorom nodes have
> reached consensus. However, according to the implementation, node 1 needs to
> receive the _Notification-1_ sent by itself in step 4 and update its
> _recvset_ (i.e., node 1's vote in the `recvset` must also be set to
> _Notification-1_) before it can consider the quorum reached.
> This figure "Inconsistency in quorom rule.png" in attachment shows the system
> state after step 5 more detailedly.
> I noticed that in the current version of the TLA+ specification, developers
> are attempting to avoid modeling the process of a node sending a message to
> itself and processing its own message. However, this simplification seems to
> lead to:
> 1. Inconsistencies between key algorithms in the specification and the
> implementation. In the example above, the inconsistent quorum rule caused me
> some confusion when I read the code.
> 2. Omission of some test scenarios. In the example above, There are some
> possible concurrent behaviors in the cluster after step 5 but before node 1
> receives its own `Notification` and updates its `recvset` in step 6. These
> concurrent behaviors cannot be verified with the current version of the
> specification.
> I believe this issue can be fixed with the following two possible solutions:
> 1. *Simpler solution*: In the _HandleNotmsg\(i\)_ action of the
> specification, update _receiveVotes[i]_ after performing
> `UpdateProposal\(i\)`. Then, modify the _HasQuorums_ action to match the
> implementation's quorum rule.
>
> 2. *Complete solution*: Modify the _BroadcastNotmsg_ action in the
> specification to include _Notification_ messages sent to itself in the
> _recvqueue_, as is actually in the implementation (currently, the
> specification directly ignores such messages). Also, improve the handling
> logic for these types of messages in _FLEReceiveNotmsg_ and _FLEHandleNotmsg_
> actions. Finally, modify the _HasQuorums_ action to match the
> implementation's quorum rule.
> If this issue is confirmed, I would be happy to provide a PR.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)