[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sirius updated ZOOKEEPER-4643:
------------------------------
    Summary: Committed txns may be improperly truncated if node crashes right 
after updating currentEpoch   (was: Committed txns may be improperly truncated 
when node crashes right after updating currentEpoch )

> Committed txns may be improperly truncated if node crashes right after 
> updating currentEpoch 
> ---------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4643
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4643
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.8.0, 3.7.1
>            Reporter: Sirius
>            Priority: Critical
>
> When a follower is processing the NEWLEADER message in SYNC phase, it will 
> update its currentEpoch to the currentEpoch file *before* writing the 
> uncommitted txns to the log file. Such order may lead to improper truncation 
> of *committed* txns in later rounds.
>  
> The critical step is to make a follower node crash right after it updates its 
> currentEpoch to the file but before writing the uncommitted txns to the log 
> file.
>  
> h2. Trace
> Here is an example to trigger the bug. (Focus on the currentEpoch and the 
> lastLoggedZxid)
> *Initial condition:*
>  - Start the ensemble with three nodes: A, B and C.
>  - Node C is elected leader.
>  - For all of them, acceptedEpoch=1, currentEpoch=1.
>  - Besides, all of them have lastLoggedZxid = <1, 3>, lastProcessedZxid = <1, 
> 3>.
> *Step 1:*
>  * Node A crashes.
> *Step 2:*
>  * A new txn is logged and committed by Node B & C. Then, B & C have 
> lastLoggedZxid = <1, 4>, lastProcessedZxid = <1, 4> ( Clients can read the 
> datatree with latest zxid <1, 4>).
> *Step 3:*
>  * Node A restarts, Node C restarts and Node B crashes.
>  * Again, C is elected leader.
>  * During the DISCOVERY phase, both A and C update their acceptedEpoch to 2.
>  * Then, during the SYNC phase, the leader C (maxCommittedLog = <1, 4>) uses 
> DIFF to sync with the follower A (lastLoggedZxid = <1, 3>), and their 
> currentEpoch will be set to 2 (and written to disk).
>  * Note that the follower A updates its currentEpoch file before writing the 
> uncommitted txns to the log file when receiving NEWLEADER message.
>  * *Unfortunately, right after the follower A finishes updating its 
> currentEpoch file, it crashes.*
> *Step 4:*
>  * Node A and B restarts and Node C crashes.
>  * Since Node A has currentEpoch=2, Node C has currentEpoch=1, Node A will be 
> elected leader.
>  * During the SYNC phase, the leader A (maxCommittedLog = <1, 3>) will use 
> TRUNC to sync with B (lastLoggedZxid = <1, 4>). Then, B removes txn <1, 4>.
>  
> However, <1, 4> was committed and visible by clients before, and is not 
> supposed to be truncated!
>  
> (Note: With careful time tuning of crash & restart, the trace can be 
> constructed with quorum nodes alive at any moment.)
>  
> The above trace has been triggered by our testing tools. I will provide more 
> materials to further demonstrate above case soon. Besides, I think the 
> affected versions might be more.
>  
> h2. Analysis
> The critical step here is to interrupt two updates that should be done 
> together. In the Zab paper, a follower updates its current epoch and history 
> during the SYNC phase in an *atomic* action, and the correctness of Zab is 
> based on that. However, in actual environment, a node crash (or other 
> environment failures) may occur at any time, breaking the procedures that 
> will not be interrupted at the protocol level. This is the gap between the 
> theory and the reality. As the system evolves, the implementation of the SYNC 
> phase becomes much more complicated compared to the original Zab protocol. We 
> think it important to keep ZooKeeper still in a correct condition even under 
> such type of environment circumstances.
>  
> h2. Possible Fix
> Intuitively, this issue can be avoided by exchanging the order of writing 
> uncommitted txns to the log file and writing currentEpoch to the currentEpoch 
> file when the follower is processing NEWLEADER message. (See the code of 
> Learner.java)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to