Wenhan Feng created ZOOKEEPER-4892:
--------------------------------------
Summary: Inconsistency of the quorum rule of the leader election
process between the Zookeeper implmentation and specification.
Key: ZOOKEEPER-4892
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4892
Project: ZooKeeper
Issue Type: Bug
Components: quorum
Affects Versions: 3.9.3
Environment: Run Zookeeper specification with TLA+ Toolbox 1.8.0 in
Win11.
Run Zookeeper implementation in ubuntu 220.4
Reporter: Wenhan Feng
Attachments: Inconsistency in quorom rule.png
It seems the quorum rule of the leader election process between the Zookeeper
implmentation and specification is inconsistent.
In {*}implementation{*}, according to the code in the _FastLeaderElection_
class, a node maintains the votes of all nodes in the cluster it considers
through a `recvset` (which are modeled into _receiveVotes_ variable in the
specification). At the same time, it maintains its own vote through the
variables {_}proposedLeader{_}, {_}proposedZxid{_}, and _proposedEpoch_ (which
are modeled into _currentVote_ variable in the specification). The quorum rule
is that when there are votes from a majority of the nodes in the _recvset_ that
match its own vote ({_}proposedLeader{_}, {_}proposedZxid{_}, and
_proposedEpoch_) , the node considers the quorum rule is satisfied and moves to
the next stage of the election.
Related codes:
{code:java}
// Line 761-780 in FastLeaderElection.java
protected SyncedLearnerTracker getVoteTracker(Map<Long, Vote> votes, Vote vote)
{
....
for (Map.Entry<Long, Vote> entry : votes.entrySet()) {
if (vote.equals(entry.getValue())) {
voteSet.addAck(entry.getKey());
}
}
return voteSet;
}
// Line 1045 in FastLeaderElection.java
voteSet = getVoteTracker(recvset, new Vote(proposedLeader, proposedZxid,
logicalclock.get(), proposedEpoch));
{code}
In {*}specification{*}, according to the action _FLEHandleNotmsg_ and
_HandleNotmsg_, the node itself is initially included in the set of nodes that
have reached consensus. Then, the specification checks whether the votes from
other nodes in the _rreceiveVotes_ match the node's _currentVote_. If the
number of nodes whose votes match the node's _currentVote_ reaches a majority
with the node itself, the quorum is considered reached.
Related codes:
{code:java}
\* Line 216-222 in FastLeaderElection.tla.
VoteSet(i, msource, rcvset, thisvote, thisround) == \{msource} \union \{s \in
(Server \ {msource}): VoteEqual(rcvset[s].vote,
rcvset[s].round,
thisvote,
thisround)}
HasQuorums(i, msource, rcvset, thisvote, thisround) == LET Q == VoteSet(i,
msource, rcvset, thisvote, thisround)
IN IF Q \in Quorums THEN
TRUE ELSE FALSE
\* Line 433 in HandleNotmsg action of FastLeaderElection.tla.
/\ LET hasQuorums == HasQuorums(i, i, receiveVotes'[i], currentVote'[i],
n.mround)
{code}
```
```
We can see that these two mechanisms are not entirely consistent. Consider the
following three-node scenario:
1. The cluster starts, and at this point, node 1's _recvset_ contains votes
from nodes 1, 2, and 3, all of which are initial votes (denoted as
`InitialVote`).
2. Node 3 sends a _Notification-1_ (leader=N3, zxid=1, electionEpoch=1) to
nodes 1, 2, and 3.
3. Node 1 receives _Notification-1_ from node 3.
4. Node 1 updates its vote to _Notification-1_ and sends _Notification-1_ to
nodes 1, 2, and 3 again.
5. Node 1 updates its _recvset_. At this point, node 1's `recvset` is \{N1:
_InitialVote_, N2: _InitialVote_, N3: _Notification-1_}.
6. According to the specification, node 1 now believes that quorom nodes have
reached consensus. However, according to the implementation, node 1 needs to
receive the _Notification-1_ sent by itself in step 4 and update its _recvset_
(i.e., node 1's vote in the `recvset` must also be set to _Notification-1_)
before it can consider the quorum reached.
This figure "Inconsistency in quorom rule.png" in attachment shows the system
state after step 5 more detailedly.
I noticed that in the current version of the TLA+ specification, developers are
attempting to avoid modeling the process of a node sending a message to itself
and processing its own message. However, this simplification seems to lead to:
1. Inconsistencies between key algorithms in the specification and the
implementation. In the example above, the inconsistent quorum rule caused me
some confusion when I read the code.
2. Omission of some test scenarios. In the example above, There are some
possible concurrent behaviors in the cluster after step 5 but before node 1
receives its own `Notification` and updates its `recvset` in step 6. These
concurrent behaviors cannot be verified with the current version of the
specification.
I believe this issue can be fixed with the following two possible solutions:
1. *Simpler solution*: In the _HandleNotmsg\(i\)_ action of the specification,
update _receiveVotes[i]_ after performing `UpdateProposal\(i\)`. Then, modify
the _HasQuorums_ action to match the implementation's quorum rule.
2. *Complete solution*: Modify the _BroadcastNotmsg_ action in the
specification to include _Notification_ messages sent to itself in the
_recvqueue_, as is actually in the implementation (currently, the specification
directly ignores such messages). Also, improve the handling logic for these
types of messages in _FLEReceiveNotmsg_ and _FLEHandleNotmsg_ actions. Finally,
modify the _HasQuorums_ action to match the implementation's quorum rule.
If this issue is confirmed, I would be happy to provide a PR.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)