Beom Heyn Kim created ZOOKEEPER-2832:
----------------------------------------
Summary: Data Inconsistency occurs if follower has uncommitted
transaction in the log while synchronizing with the leader that has the lower
last processed zxid
Key: ZOOKEEPER-2832
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2832
Project: ZooKeeper
Issue Type: Bug
Components: quorum
Affects Versions: 3.4.9
Reporter: Beom Heyn Kim
Fix For: 3.4.10
Synchronization code may fail to truncate an uncommitted transaction in the
follower’s transaction log. Here is a scenario:
Initial condition:
Start the ensemble with three nodes A, B and C with C being the leader
The current epoch is 1
For simplicity of the example, let’s say zxid is a two digit number, with epoch
being the first digit
Create two znodes ‘key0’ and ‘key1’ whose value is ‘0’ and ‘1’, respectively
The zxid is 12 -- 11 for creating key0 and 12 for creating key1. (For
simplicity of the example, the zxid gets increased only by transactions
directly changing the data of znodes.)
All the nodes have seen the change 12 and have persistently logged it
Shut down all
Step 1
Start Node A and B. Epoch becomes 2. Then, a request, setData(key0, 1000), with
zxid 21 is issued. The leader B writes it to the log but Node A is shutdown
before writing it to the log. Then, the leader B is also shut down. The change
21 is applied only to B but not to A or C.
Step 2
Start Node A and C. Epoch becomes 3. Node A has the higher zxid than Node C
(i.e. 20 > 12). So, Node A becomes the leader. Yet, the last processed zxid is
12 for both Node A and C. So, they are in sync already. Node A sends an empty
DIFF to Node C. Node C takes a snapshot and creates snapshot.12. Then, A and C
are shut down. Now, C has the higher zxid than Node B.
Step 3
Start Node B and C. Epoch becomes 4. Node C has the higher zxid than Node B
(i.e. 30 > 21). So, Node C becomes the leader. Node B and C has the different
last processed zxid (i.e. 21 vs 12), and the LinkedList object ‘proposals’ is
empty. Thus, Node C sends SNAP to Node B. Node B takes a clean snapshot and
creates snapshot.12 as the zxid 12 is the last processed zxid of the leader C.
(Note the newly created snapshot on B is assigned the lower zxid then the
change 21 in the log). Then, the request, setData(key1, 1001), with zxid 41 is
issued. Both B and C apply the change 41 into their logs. (Note that now B and
C have the same last processed zxid) Then, B and C are shut down.
Step 4
Start Node B and C. Epoch becomes 5. Node B and C use their local log and
snapshot files to restore their in-memory data tree. Node B has 1000 as the
value of key0, because it’s latest valid snapshot is snapshot.12 and there was
a later transaction with zxid 21 in its log. Yet, Node C has 0 as the value of
key0, because the change 21 was never written on C. Node C is the leader. Node
B and C have the same last processed zxid, i.e. 41. So, they are considered to
be in sync already, and Node C sends an empty DIFF to Node B. So, the
synchronization completes with the initially restored in-memory data tree on B
and C.
Problem
The value of key0 on B is 1000, while the value of the key0 on Node C is 0. The
LearnerHandler.run on C at Step 3, never sends TRUNC but just SNAP. So,
the change 21 was never truncated on B. Also, at step 4, since B uses the
snapshot of the lower zxid to restore its in-memory data tree, the change 21
could get into the data tree. Then, the leader C at the step 4 did not send
SNAP, because the change 41 made to both B and C makes the leader C think the B
and C are already in sync. Thus, data inconsistency occurs.
The attached test case can deterministically reproduce the bug.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)