[
https://issues.apache.org/jira/browse/ZOOKEEPER-4646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sirius updated ZOOKEEPER-4646:
------------------------------
Summary: Committed txns may still be lost if followers crash after replying
ACK of NEWLEADER but before writing txns to disk (was: Committed txns may
still be lost if followers crash after replying ACK-LD but before writing txns
to disk)
> Committed txns may still be lost if followers crash after replying ACK of
> NEWLEADER but before writing txns to disk
> -------------------------------------------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-4646
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4646
> Project: ZooKeeper
> Issue Type: Bug
> Components: quorum, server
> Affects Versions: 3.6.3, 3.7.0, 3.8.0, 3.7.1, 3.8.1
> Reporter: Sirius
> Priority: Critical
> Attachments: Trace-ZK-4646.pdf
>
>
> When a follower is processing the NEWLEADER message in SYNC phase, its
> QuorumPeer thread will call {{logRequest(..)}} to submit the txn persistence
> task to the SyncThread. The SyncThread will persist txns asynchronously and
> does not promise to finish the task before the follower replies ACK-LD (i.e.
> ACK of NEWLEADER) to the leader, which may lead to committed data loss.
> Actually, this problem had been first raised in ZOOKEEPER-3911 . However, the
> fix of ZOOKEEPER-3911 does not solve the problem at the root. The following
> trace can still be generated in the latest version nowadays.
>
> h2. Trace
> [^Trace-ZK-4646.pdf]
> The trace is basically the same as the one in ZOOKEEPER-3911 (See the first
> comment provided by [~hanm] in that issue). For convenience we use the zxid
> to represent a txn here.
> Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
> - +S2+ is elected leader.
> - All of them have the same log with the last zxid <1, 3>.
> - +S2+ logs a new txn <1, 4> and makes a broadcast.
> - +S0+ & +S1+ crash before they receive the proposal of <1, 4>.
> - +S0+ & +S1+ restart.
> - +S2+ is elected leader again.
> - +S0+ & +S1+ DIFF sync with +S2+ .
> - +S0+ & +S1+ send ACK-LD to +S2+ *before* their SyncThreads log txns to
> disk. (This is possible because txn logging is processed asynchronously! )
> - Verify clients of +S2+ have the view of <1, 4>.
> - The followers +S0+ & +S1+ crash *before* their SyncThreads persist txns to
> disk. (This is extremely timing sensitive but possible! )
> - +S0+ & +S1+ restart, and +S2+ crashes.
> - Verify clients of +S0+ & +S1+ do NOT have the view of <1, 4>, a violation
> of ZAB.
>
> Extra note: The trace can be constructed with quorum nodes alive at any
> moment with careful time tuning of node shutdown & restart, e.g., let +S0+ &
> +S1+ shutdown and restart one by one in a short time.
>
> h2. Analysis
> *Root Cause:*
> The root cause lies in the asynchronous executions by multi-threads.
> When a follower replies ACK-LD, it should promise that it has already logged
> the initial history of the leader (according to ZAB). However, txn logging is
> executed by the SyncThread asynchronously, so the above promise can be
> violated. It is possible that, after the leader receives ACK-LD, believing
> that the responding follower has been in sync, and then gets into the
> BROADCAST phase, while in fact the history of the follower is not in sync
> yet. At this time, environment failures might prevent the follower from
> logging successfully. When that node with stale or incomplete committed
> history is elected leader later, it might lose txns that have been committed
> and applied on the former leader node.
> The implementation adopts the multi-threading style for performance
> optimization. However, it may bring some underlying subtle bugs that will not
> occur at the protocol level. The fix of ZOOKEEPER-3911 simply calls
> {{logRequest(..)}} to submit the logging requests to SyncRequestProcessor's
> queue before replying ACK-LD inside the NEWLEADER processing logic, without
> further considering the risk of asynchronous executions by multi-threads.
> When the follower replies ACK-LD and then crashes before its SyncThread
> writes txns to disk, the problem is triggered.
>
> *Property Violation:*
> From the server side, the committed log of the ensemble does not append
> monotonically; different nodes have inconsistent committed logs. From the
> client side, clients connected to different nodes may have inconsistent
> views. A client may read stale data after a newer version is obtained. That
> newer version can only be obtained from certain nodes of the ensemble rather
> than all nodes. What's worse, that newer version may also be removed later.
>
> *Affected Versions:*
> The above trace has been generated in multiple versions such as 3.7.1 & 3.8.1
> (the latest stable & current version till now) by our testing tools. The
> affected versions might be more, since the critical partial order between the
> follower's replying ACK-LD and updating its history during SYNC stay
> non-deterministic in multiple versions.
>
> h2. Possible Fix
> Considering this issue and ZOOKEEPER-4685 , one possible fix is to guarantee
> the following partial orders to be satisfied:
> * A follower replies ACK-LD only after it has persisted the txns that might
> be applied to the leader's datatree before the leader gets into the BROADCAST
> phase (so as to avoid this issue).
> * The follower replies ACK of PROPOSAL only after it replies ACK-LD (i.e.
> ACK of NEWLEADER) to the leader (so as to avoid ZOOKEEPER-4685 ).
>
> We will fix this bug soon.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)