[jira] [Updated] (ZOOKEEPER-4646) Committed txns may still be lost if followers crash after replying ACK-LD but before writing txns to disk

Sirius (Jira) Mon, 12 Dec 2022 20:27:25 -0800


     [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sirius updated ZOOKEEPER-4646:
------------------------------
    Summary: Committed txns may still be lost if followers crash after replying 
ACK-LD but before writing txns to disk  (was: Committed txns may still be lost 
if followers crashes after replying ACK-LD but before writing txns to disk)

> Committed txns may still be lost if followers crash after replying ACK-LD but 
> before writing txns to disk
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4646
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4646
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.6.3, 3.7.0, 3.8.0, 3.7.1
>            Reporter: Sirius
>            Priority: Critical
>         Attachments: Trace-ZK-4646.pdf
>
>
> When a follower is processing the NEWLEADER message in SYNC phase, it will 
> call {{logRequest(..)}} to submit the txn persistence task to the 
> SyncRequestProcessor thread. The SyncRequestProcessor thread will persist 
> txns asynchronously and does not promise to finish the task before the 
> follower replies ACK-LD (i.e. ACK of NEWLEADER) to the leader, which may lead 
> to committed data loss.
> Actually, this problem had been first raised in ZOOKEEPER-3911 . However, the 
> fix of  ZOOKEEPER-3911  does not solve the problem at the root. The following 
> trace can still be generated in the latest version nowadays.
> h2. Trace
> [^Trace-ZK-4646.pdf]
> The trace is basically the same as the one in ZOOKEEPER-3911 (See the first 
> comment provided by [~hanm] in that issue).  For convenience we use the zxid 
> to represent a txn here.
> Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
>  - +S2+ is elected leader.
>  - All of them have the same log with the last zxid <1, 3>.
>  - +S2+ logs a new txn <1, 4> and makes a broadcast.
>  - +S0+ & +S1+ crash before they receive the proposal of <1, 4>.
>  - +S0+ & +S1+ restart.
>  - +S2+ is elected leader again. 
>  - +S0+ & +S1+ DIFF sync with +S2+ .
>  - +S0+ & +S1+ send ACK-LD to +S2+ *before* their SyncRequestProcessor 
> threads log txns to disk. (This is possible because txn logging is processed 
> asynchronously! )
>  - Verify clients of +S2+ have the view of <1, 4>.
>  - The followers +S0+ & +S1+ crash *before* their SyncRequestProcessor 
> threads persist txns to disk. (This is extremely timing sensitive but 
> possible! )
>  - +S0+ & +S1+ restart, and +S2+ crashes.
>  - Verify clients of +S0+ & +S1+ do NOT have the view of <1, 4>, a violation 
> of ZAB.
>  
> Extra note: The trace can be constructed with quorum nodes alive at any 
> moment with careful time tuning of node shutdown & restart, e.g., let +S0+ & 
> +S1+ shutdown and restart one by one in a short time.
> h2. Analysis
> *Root Cause:*
> The root cause lies in the asynchronous executions by multi-threads.
> When a follower replies ACK-LD, it should promise that it has already logged 
> the initial history of the leader (according to ZAB). However, txn logging is 
> executed by the SyncRequestProcessor thread asynchronously, so the above 
> promise can be violated. It is possible that, after the leader receives 
> ACK-LD, believing that the responding follower has been in sync, and then 
> gets into the BROADCAST phase, while in fact the history of the follower is 
> not in sync yet. At this time, environment failures might prevent the 
> follower from logging successfully. When that node with stale or incomplete 
> committed history is elected leader later, it might lose txns that have been 
> committed and applied on the former leader node.
> The implementation adopts the multi-threading style for performance 
> optimization. However, it may bring some underlying subtle bugs that will not 
> occur at the protocol level. The fix of ZOOKEEPER-3911 simply calls 
> {{logRequest(..)}} to submit the logging requests to SyncRequestProcessor's 
> queue before replying ACK-LD inside the NEWLEADER processing logic, without 
> further considering the risk of asynchronous executions by multi-threads. 
> When the follower replies ACK-LD and then crashes before its 
> SyncRequestProcessor thread writes txns to disk, the problem is triggered.
> *Property Violation:*
> From the server side, the committed log of the ensemble does not append 
> monotonically; different nodes have inconsistent committed logs. From the 
> client side, clients connected to different nodes may have inconsistent 
> views. A client may read stale data after a newer version is obtained. That 
> newer version can only be obtained from certain nodes of the ensemble rather 
> than all nodes. What's worse, that newer version may also be removed later.
> Although ZOOKEEPER-4643 has similar symptoms and property violations, it 
> should be regarded as a distinct problem because it has different root cause 
> and risk pattern compared to this one. More specifically,
>  * ZOOKEEPER-4643 : the risk lies in the order of updating currentEpoch 
> before logging txns to disk. The bug can be triggered by interrupting the 
> action of logging txns after currentEpoch is updated. 
>  * ZOOKEEPER-4646 : the risk lies in the order of replying ACK-LD before 
> logging txns to disk. The bug can be triggered by interrupting the action of 
> logging txns after ACK-LD is replied. 
> *Affected Versions:*
> The above trace has been generated in multiple versions such as 3.7.1 & 3.8.0 
> (the latest stable & current version till now) by our testing tools. The 
> affected versions might be more, since the critical update order between the 
> follower's replying ACK-LD and updating its history during SYNC stay 
> non-deterministic in multiple versions.
>  
> h2. Possible Fix
> Considering this issue and ZOOKEEPER-4643 , one possible fix is to guarantee 
> the following partial orders to be satisfied:
>  * A follower updates its {{_currentEpoch_}} only after it has persisted the 
> txns that might be applied to the leader's datatree before the leader gets 
> into the BROADCAST phase (so as to avoid the issue of ZOOKEEPER-4643 ). 
>  * A follower replies ACK-LD only after it has persisted the txns that might 
> be applied to the leader's datatree before the leader gets into the BROADCAST 
> phase  (so as to avoid this issue).
> Intuitively, the {{CompletableFuture}} class introduced in Java 8 may be a 
> suitable choice to fix the issue. With the method of {{complete(..)}} and 
> {{whenComplete(..)}} provided by {{{}CompletableFuture{}}}, the 
> SyncRequestProcessor can explicitly set the txn logging process completed and 
> trigger the process of updating {{_currenEpoch_}} and replying ACK-LD.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ZOOKEEPER-4646) Committed txns may still be lost if followers crash after replying ACK-LD but before writing txns to disk

Reply via email to