[
https://issues.apache.org/jira/browse/ZOOKEEPER-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sirius updated ZOOKEEPER-4643:
------------------------------
Summary: Committed txns may be improperly truncated if node crashes right
after updating currentEpoch (was: Committed txns may be improperly truncated
when node crashes right after updating currentEpoch )
> Committed txns may be improperly truncated if node crashes right after
> updating currentEpoch
> ---------------------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-4643
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4643
> Project: ZooKeeper
> Issue Type: Bug
> Components: quorum, server
> Affects Versions: 3.8.0, 3.7.1
> Reporter: Sirius
> Priority: Critical
>
> When a follower is processing the NEWLEADER message in SYNC phase, it will
> update its currentEpoch to the currentEpoch file *before* writing the
> uncommitted txns to the log file. Such order may lead to improper truncation
> of *committed* txns in later rounds.
>
> The critical step is to make a follower node crash right after it updates its
> currentEpoch to the file but before writing the uncommitted txns to the log
> file.
>
> h2. Trace
> Here is an example to trigger the bug. (Focus on the currentEpoch and the
> lastLoggedZxid)
> *Initial condition:*
> - Start the ensemble with three nodes: A, B and C.
> - Node C is elected leader.
> - For all of them, acceptedEpoch=1, currentEpoch=1.
> - Besides, all of them have lastLoggedZxid = <1, 3>, lastProcessedZxid = <1,
> 3>.
> *Step 1:*
> * Node A crashes.
> *Step 2:*
> * A new txn is logged and committed by Node B & C. Then, B & C have
> lastLoggedZxid = <1, 4>, lastProcessedZxid = <1, 4> ( Clients can read the
> datatree with latest zxid <1, 4>).
> *Step 3:*
> * Node A restarts, Node C restarts and Node B crashes.
> * Again, C is elected leader.
> * During the DISCOVERY phase, both A and C update their acceptedEpoch to 2.
> * Then, during the SYNC phase, the leader C (maxCommittedLog = <1, 4>) uses
> DIFF to sync with the follower A (lastLoggedZxid = <1, 3>), and their
> currentEpoch will be set to 2 (and written to disk).
> * Note that the follower A updates its currentEpoch file before writing the
> uncommitted txns to the log file when receiving NEWLEADER message.
> * *Unfortunately, right after the follower A finishes updating its
> currentEpoch file, it crashes.*
> *Step 4:*
> * Node A and B restarts and Node C crashes.
> * Since Node A has currentEpoch=2, Node C has currentEpoch=1, Node A will be
> elected leader.
> * During the SYNC phase, the leader A (maxCommittedLog = <1, 3>) will use
> TRUNC to sync with B (lastLoggedZxid = <1, 4>). Then, B removes txn <1, 4>.
>
> However, <1, 4> was committed and visible by clients before, and is not
> supposed to be truncated!
>
> (Note: With careful time tuning of crash & restart, the trace can be
> constructed with quorum nodes alive at any moment.)
>
> The above trace has been triggered by our testing tools. I will provide more
> materials to further demonstrate above case soon. Besides, I think the
> affected versions might be more.
>
> h2. Analysis
> The critical step here is to interrupt two updates that should be done
> together. In the Zab paper, a follower updates its current epoch and history
> during the SYNC phase in an *atomic* action, and the correctness of Zab is
> based on that. However, in actual environment, a node crash (or other
> environment failures) may occur at any time, breaking the procedures that
> will not be interrupted at the protocol level. This is the gap between the
> theory and the reality. As the system evolves, the implementation of the SYNC
> phase becomes much more complicated compared to the original Zab protocol. We
> think it important to keep ZooKeeper still in a correct condition even under
> such type of environment circumstances.
>
> h2. Possible Fix
> Intuitively, this issue can be avoided by exchanging the order of writing
> uncommitted txns to the log file and writing currentEpoch to the currentEpoch
> file when the follower is processing NEWLEADER message. (See the code of
> Learner.java)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)