[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sirius updated ZOOKEEPER-4643:
------------------------------
    Description: 
When a follower is processing the NEWLEADER message in SYNC phase, it will 
update its currentEpoch to the file *before* writing the uncommitted txns to 
the log file. Such order may lead to improper truncation of *committed* txns in 
later rounds.

 

The critical step is to make a follower node crash right after it updates its 
currentEpoch to the file but before writing the uncommitted txns to the log 
file.

 
h2. Trace

Here is an example to trigger the bug. (Focus on _currentEpoch_ and 
{_}lastLoggedZxid{_})

{*}Round 1 (Running nodes updating their acceptedEpoch & currentEpoch to 
1{*}{*}):{*}
 - Start the ensemble with three nodes: A, B and C.
 - Node C is elected leader.
 - For all of them, _acceptedEpoch_ = 1, _currentEpoch_ = 1.
 - Besides, all of them have _lastLoggedZxid_ = <1, 3>, _lastProcessedZxid_ = 
<1, 3>.
 - Node A crashes.
 - A new txn is logged and committed by Node B & C. Then, B & C have 
_lastLoggedZxid_ = <1, 4>, _lastProcessedZxid_ = <1, 4> ( Clients can read the 
datatree with latest zxid <1, 4>).

*Round 2:*
 * Node A restarts, Node C restarts and Node B crashes.
 * Again, C is elected leader.
 * During the DISCOVERY phase, both A and C update their acceptedEpoch to 2.
 * Then, during the SYNC phase, the leader C (maxCommittedLog = <1, 4>) uses 
DIFF to sync with the follower A (lastLoggedZxid = <1, 3>), and their 
currentEpoch will be set to 2 (and written to disk).
 * Note that the follower A updates its currentEpoch file before writing the 
uncommitted txns to the log file when receiving NEWLEADER message.
 * *Unfortunately, right after the follower A finishes updating its 
currentEpoch file, it crashes.*

*Round 3:*
 * Node A and B restarts and Node C crashes.
 * Since Node A has currentEpoch=2, Node C has currentEpoch=1, Node A will be 
elected leader.
 * During the SYNC phase, the leader A (maxCommittedLog = <1, 3>) will use 
TRUNC to sync with B (lastLoggedZxid = <1, 4>). Then, B removes txn <1, 4>.

 

However, <1, 4> was committed and visible by clients before, and is not 
supposed to be truncated!

 

(Note: With careful time tuning of crash & restart, the trace can be 
constructed with quorum nodes alive at any moment.)

 

The above trace has been triggered by our testing tools. I will provide more 
materials to further demonstrate above case soon. Besides, I think the affected 
versions might be more.

 
h2. Analysis

The critical step here is to interrupt two updates that should be done 
together. In the Zab paper, a follower updates its current epoch and history 
during the SYNC phase in an *atomic* action, and the correctness of Zab is 
based on that. However, in actual environment, a node crash (or other 
environment failures) may occur at any time, breaking the procedures that will 
not be interrupted at the protocol level. This is the gap between the theory 
and the reality. As the system evolves, the implementation of the SYNC phase 
becomes much more complicated compared to the original Zab protocol. We think 
it important to keep ZooKeeper still in a correct condition even under such 
type of environment circumstances.

 
h2. Possible Fix

Intuitively, this issue can be avoided by exchanging the order of writing 
uncommitted txns to the log file and writing currentEpoch to the currentEpoch 
file when the follower is processing NEWLEADER message. (See the code of 
Learner.java)

  was:
When a follower is processing the NEWLEADER message in SYNC phase, it will 
update its currentEpoch to the currentEpoch file *before* writing the 
uncommitted txns to the log file. Such order may lead to improper truncation of 
*committed* txns in later rounds.

 

The critical step is to make a follower node crash right after it updates its 
currentEpoch to the file but before writing the uncommitted txns to the log 
file.

 
h2. Trace

Here is an example to trigger the bug. (Focus on the currentEpoch and the 
lastLoggedZxid)

*Initial condition:*
 - Start the ensemble with three nodes: A, B and C.
 - Node C is elected leader.
 - For all of them, acceptedEpoch=1, currentEpoch=1.
 - Besides, all of them have lastLoggedZxid = <1, 3>, lastProcessedZxid = <1, 
3>.

*Step 1:*
 * Node A crashes.

*Step 2:*
 * A new txn is logged and committed by Node B & C. Then, B & C have 
lastLoggedZxid = <1, 4>, lastProcessedZxid = <1, 4> ( Clients can read the 
datatree with latest zxid <1, 4>).

*Step 3:*
 * Node A restarts, Node C restarts and Node B crashes.
 * Again, C is elected leader.
 * During the DISCOVERY phase, both A and C update their acceptedEpoch to 2.
 * Then, during the SYNC phase, the leader C (maxCommittedLog = <1, 4>) uses 
DIFF to sync with the follower A (lastLoggedZxid = <1, 3>), and their 
currentEpoch will be set to 2 (and written to disk).
 * Note that the follower A updates its currentEpoch file before writing the 
uncommitted txns to the log file when receiving NEWLEADER message.
 * *Unfortunately, right after the follower A finishes updating its 
currentEpoch file, it crashes.*

*Step 4:*
 * Node A and B restarts and Node C crashes.
 * Since Node A has currentEpoch=2, Node C has currentEpoch=1, Node A will be 
elected leader.
 * During the SYNC phase, the leader A (maxCommittedLog = <1, 3>) will use 
TRUNC to sync with B (lastLoggedZxid = <1, 4>). Then, B removes txn <1, 4>.

 

However, <1, 4> was committed and visible by clients before, and is not 
supposed to be truncated!

 

(Note: With careful time tuning of crash & restart, the trace can be 
constructed with quorum nodes alive at any moment.)

 

The above trace has been triggered by our testing tools. I will provide more 
materials to further demonstrate above case soon. Besides, I think the affected 
versions might be more.

 
h2. Analysis

The critical step here is to interrupt two updates that should be done 
together. In the Zab paper, a follower updates its current epoch and history 
during the SYNC phase in an *atomic* action, and the correctness of Zab is 
based on that. However, in actual environment, a node crash (or other 
environment failures) may occur at any time, breaking the procedures that will 
not be interrupted at the protocol level. This is the gap between the theory 
and the reality. As the system evolves, the implementation of the SYNC phase 
becomes much more complicated compared to the original Zab protocol. We think 
it important to keep ZooKeeper still in a correct condition even under such 
type of environment circumstances.

 
h2. Possible Fix

Intuitively, this issue can be avoided by exchanging the order of writing 
uncommitted txns to the log file and writing currentEpoch to the currentEpoch 
file when the follower is processing NEWLEADER message. (See the code of 
Learner.java)


> Committed txns may be improperly truncated if node crashes right after 
> updating currentEpoch 
> ---------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4643
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4643
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.8.0, 3.7.1
>            Reporter: Sirius
>            Priority: Critical
>
> When a follower is processing the NEWLEADER message in SYNC phase, it will 
> update its currentEpoch to the file *before* writing the uncommitted txns to 
> the log file. Such order may lead to improper truncation of *committed* txns 
> in later rounds.
>  
> The critical step is to make a follower node crash right after it updates its 
> currentEpoch to the file but before writing the uncommitted txns to the log 
> file.
>  
> h2. Trace
> Here is an example to trigger the bug. (Focus on _currentEpoch_ and 
> {_}lastLoggedZxid{_})
> {*}Round 1 (Running nodes updating their acceptedEpoch & currentEpoch to 
> 1{*}{*}):{*}
>  - Start the ensemble with three nodes: A, B and C.
>  - Node C is elected leader.
>  - For all of them, _acceptedEpoch_ = 1, _currentEpoch_ = 1.
>  - Besides, all of them have _lastLoggedZxid_ = <1, 3>, _lastProcessedZxid_ = 
> <1, 3>.
>  - Node A crashes.
>  - A new txn is logged and committed by Node B & C. Then, B & C have 
> _lastLoggedZxid_ = <1, 4>, _lastProcessedZxid_ = <1, 4> ( Clients can read 
> the datatree with latest zxid <1, 4>).
> *Round 2:*
>  * Node A restarts, Node C restarts and Node B crashes.
>  * Again, C is elected leader.
>  * During the DISCOVERY phase, both A and C update their acceptedEpoch to 2.
>  * Then, during the SYNC phase, the leader C (maxCommittedLog = <1, 4>) uses 
> DIFF to sync with the follower A (lastLoggedZxid = <1, 3>), and their 
> currentEpoch will be set to 2 (and written to disk).
>  * Note that the follower A updates its currentEpoch file before writing the 
> uncommitted txns to the log file when receiving NEWLEADER message.
>  * *Unfortunately, right after the follower A finishes updating its 
> currentEpoch file, it crashes.*
> *Round 3:*
>  * Node A and B restarts and Node C crashes.
>  * Since Node A has currentEpoch=2, Node C has currentEpoch=1, Node A will be 
> elected leader.
>  * During the SYNC phase, the leader A (maxCommittedLog = <1, 3>) will use 
> TRUNC to sync with B (lastLoggedZxid = <1, 4>). Then, B removes txn <1, 4>.
>  
> However, <1, 4> was committed and visible by clients before, and is not 
> supposed to be truncated!
>  
> (Note: With careful time tuning of crash & restart, the trace can be 
> constructed with quorum nodes alive at any moment.)
>  
> The above trace has been triggered by our testing tools. I will provide more 
> materials to further demonstrate above case soon. Besides, I think the 
> affected versions might be more.
>  
> h2. Analysis
> The critical step here is to interrupt two updates that should be done 
> together. In the Zab paper, a follower updates its current epoch and history 
> during the SYNC phase in an *atomic* action, and the correctness of Zab is 
> based on that. However, in actual environment, a node crash (or other 
> environment failures) may occur at any time, breaking the procedures that 
> will not be interrupted at the protocol level. This is the gap between the 
> theory and the reality. As the system evolves, the implementation of the SYNC 
> phase becomes much more complicated compared to the original Zab protocol. We 
> think it important to keep ZooKeeper still in a correct condition even under 
> such type of environment circumstances.
>  
> h2. Possible Fix
> Intuitively, this issue can be avoided by exchanging the order of writing 
> uncommitted txns to the log file and writing currentEpoch to the currentEpoch 
> file when the follower is processing NEWLEADER message. (See the code of 
> Learner.java)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to