Beom Heyn Kim created ZOOKEEPER-2832:
----------------------------------------

             Summary: Data Inconsistency occurs if follower has uncommitted 
transaction in the log while synchronizing with the leader that has the lower 
last processed zxid
                 Key: ZOOKEEPER-2832
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2832
             Project: ZooKeeper
          Issue Type: Bug
          Components: quorum
    Affects Versions: 3.4.9
            Reporter: Beom Heyn Kim
             Fix For: 3.4.10


Synchronization code may fail to truncate an uncommitted transaction in the 
follower’s transaction log. Here is a scenario:
 
Initial condition:
Start the ensemble with three nodes A, B and C with C being the leader
The current epoch is 1
For simplicity of the example, let’s say zxid is a two digit number, with epoch 
being the first digit
Create two znodes ‘key0’ and ‘key1’ whose value is ‘0’ and ‘1’, respectively
The zxid is 12 -- 11 for creating key0 and 12 for creating key1. (For 
simplicity of the example, the zxid gets increased only by transactions 
directly changing the data of znodes.)
All the nodes have seen the change 12 and have persistently logged it
Shut down all
 
Step 1
Start Node A and B. Epoch becomes 2. Then, a request, setData(key0, 1000), with 
zxid 21 is issued. The leader B writes it to the log but Node A is shutdown 
before writing it to the log. Then, the leader B is also shut down. The change 
21 is applied only to B but not to A or C.
 
Step 2
Start Node A and C. Epoch becomes 3. Node A has the higher zxid than Node C 
(i.e. 20 > 12). So, Node A becomes the leader. Yet, the last processed zxid is 
12 for both Node A and C. So, they are in sync already. Node A sends an empty 
DIFF to Node C. Node C takes a snapshot and creates snapshot.12. Then, A and C 
are shut down. Now, C has the higher zxid than Node B.
 
Step 3
Start Node B and C. Epoch becomes 4. Node C has the higher zxid than Node B 
(i.e. 30 > 21). So, Node C becomes the leader. Node B and C has the different 
last processed zxid (i.e. 21 vs 12), and the LinkedList object ‘proposals’ is 
empty. Thus, Node C sends SNAP to Node B. Node B takes a clean snapshot and 
creates snapshot.12 as the zxid 12 is the last processed zxid of the leader C. 
(Note the newly created snapshot on B is assigned the lower zxid then the 
change 21 in the log). Then, the request, setData(key1, 1001), with zxid 41 is 
issued. Both B and C apply the change 41 into their logs. (Note that now B and 
C have the same last processed zxid) Then, B and C are shut down.
 
Step 4
Start Node B and C. Epoch becomes 5.  Node B and C use their local log and 
snapshot files to restore their in-memory data tree. Node B has 1000 as the 
value of key0, because it’s latest valid snapshot is snapshot.12 and there was 
a later transaction with zxid 21 in its log. Yet, Node C has 0 as the value of 
key0, because the change 21 was never written on C. Node C is the leader. Node 
B and C have the same last processed zxid, i.e. 41. So, they are considered to 
be in sync already, and Node C sends an empty DIFF to Node B. So, the 
synchronization completes with the initially restored in-memory data tree on B 
and C.
 
Problem
The value of key0 on B is 1000, while the value of the key0 on Node C is 0. The 
LearnerHandler.run on C at Step 3,      never sends TRUNC but just SNAP. So, 
the change 21 was never truncated on B. Also, at step 4, since B uses the 
snapshot of the lower zxid to restore its in-memory data tree, the change 21 
could get into the data tree. Then, the leader C at the step 4 did not send 
SNAP, because the change 41 made to both B and C makes the leader C think the B 
and C are already in sync. Thus, data inconsistency occurs.
 
The attached test case can deterministically reproduce the bug.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to