Wenhan Feng created ZOOKEEPER-4892:
--------------------------------------

             Summary: Inconsistency of the quorum rule of the leader election 
process between the Zookeeper implmentation and specification.
                 Key: ZOOKEEPER-4892
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4892
             Project: ZooKeeper
          Issue Type: Bug
          Components: quorum
    Affects Versions: 3.9.3
         Environment: Run Zookeeper specification with TLA+ Toolbox 1.8.0 in 
Win11.

Run Zookeeper implementation in ubuntu 220.4
            Reporter: Wenhan Feng
         Attachments: Inconsistency in quorom rule.png

It seems the quorum rule of the leader election process between the Zookeeper 
implmentation and specification is inconsistent.

In {*}implementation{*}, according to the code in the _FastLeaderElection_
class, a node maintains the votes of all nodes in the cluster it considers 
through a `recvset` (which are modeled into _receiveVotes_ variable in the 
specification). At the same time, it maintains its own vote through the 
variables {_}proposedLeader{_}, {_}proposedZxid{_}, and _proposedEpoch_ (which 
are modeled into _currentVote_ variable in the specification). The quorum rule 
is that when there are votes from a majority of the nodes in the _recvset_ that 
match its own vote ({_}proposedLeader{_}, {_}proposedZxid{_}, and 
_proposedEpoch_) , the node considers the quorum rule is satisfied and moves to 
the next stage of the election.

Related codes:
{code:java}
// Line 761-780 in FastLeaderElection.java
protected SyncedLearnerTracker getVoteTracker(Map<Long, Vote> votes, Vote vote) 
{
        ....
        for (Map.Entry<Long, Vote> entry : votes.entrySet()) {
            if (vote.equals(entry.getValue())) {
                voteSet.addAck(entry.getKey());
            }
        }
        return voteSet;
}

// Line 1045 in FastLeaderElection.java
voteSet = getVoteTracker(recvset, new Vote(proposedLeader, proposedZxid, 
logicalclock.get(), proposedEpoch));
{code}
In {*}specification{*}, according to the action _FLEHandleNotmsg_ and 
_HandleNotmsg_, the node itself is initially included in the set of nodes that 
have reached consensus. Then, the specification checks whether the votes from 
other nodes in the _rreceiveVotes_ match the node's _currentVote_. If the 
number of nodes whose votes match the node's _currentVote_ reaches a majority 
with the node itself, the quorum is considered reached.

Related codes:
{code:java}
\* Line 216-222 in FastLeaderElection.tla.
VoteSet(i, msource, rcvset, thisvote, thisround) == \{msource} \union \{s \in 
(Server \ {msource}): VoteEqual(rcvset[s].vote,
                                                                                
                            rcvset[s].round,
                                                                                
                            thisvote,
                                                                                
                            thisround)}

HasQuorums(i, msource, rcvset, thisvote, thisround) == LET Q == VoteSet(i, 
msource, rcvset, thisvote, thisround)
                                                       IN IF Q \in Quorums THEN 
TRUE ELSE FALSE

\* Line 433 in HandleNotmsg action of FastLeaderElection.tla.
/\ LET hasQuorums == HasQuorums(i, i, receiveVotes'[i], currentVote'[i], 
n.mround)


{code}
```

```

We can see that these two mechanisms are not entirely consistent. Consider the 
following three-node scenario:

1. The cluster starts, and at this point, node 1's _recvset_ contains votes 
from nodes 1, 2, and 3, all of which are initial votes (denoted as 
`InitialVote`).
2. Node 3 sends a _Notification-1_ (leader=N3, zxid=1, electionEpoch=1) to 
nodes 1, 2, and 3.
3. Node 1 receives _Notification-1_ from node 3.
4. Node 1 updates its vote to _Notification-1_ and sends _Notification-1_ to 
nodes 1, 2, and 3 again.
5. Node 1 updates its _recvset_. At this point, node 1's `recvset` is \{N1: 
_InitialVote_, N2: _InitialVote_, N3: _Notification-1_}.
6. According to the specification, node 1 now believes that quorom nodes have 
reached consensus. However, according to the implementation, node 1 needs to 
receive the _Notification-1_ sent by itself in step 4 and update its _recvset_ 
(i.e., node 1's vote in the `recvset` must also be set to _Notification-1_) 
before it can consider the quorum reached.

This figure "Inconsistency in quorom rule.png" in attachment shows the system 
state after step 5 more detailedly.

I noticed that in the current version of the TLA+ specification, developers are 
attempting to avoid modeling the process of a node sending a message to itself 
and processing its own message. However, this simplification seems to lead to:
1. Inconsistencies between key algorithms in the specification and the 
implementation. In the example above, the inconsistent quorum rule caused me 
some confusion when I read the code.
2. Omission of some test scenarios. In the example above, There are some 
possible concurrent behaviors in the cluster after step 5 but before node 1 
receives its own `Notification` and updates its `recvset` in step 6. These 
concurrent behaviors cannot be verified with the current version of the 
specification.

I believe this issue can be fixed with the following two possible solutions:

1. *Simpler solution*: In the _HandleNotmsg\(i\)_ action of the specification, 
update _receiveVotes[i]_ after performing `UpdateProposal\(i\)`. Then, modify 
the _HasQuorums_ action to match the implementation's quorum rule.
   
2. *Complete solution*: Modify the _BroadcastNotmsg_ action in the 
specification to include _Notification_ messages sent to itself in the 
_recvqueue_, as is actually in the implementation (currently, the specification 
directly ignores such messages). Also, improve the handling logic for these 
types of messages in _FLEReceiveNotmsg_ and _FLEHandleNotmsg_ actions. Finally, 
modify the _HasQuorums_ action to match the implementation's quorum rule.

If this issue is confirmed, I would be happy to provide a PR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to