[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sirius updated ZOOKEEPER-4646:
------------------------------
    Description: 
When a follower is processing the NEWLEADER message in SYNC phase, it will call 
{{logRequest(..)}} to submit the txn persistence task to the SyncThread. The 
SyncThread will persist txns asynchronously and does not promise to finish the 
task before the follower replies ACK-LD (i.e. ACK of NEWLEADER) to the leader, 
which may lead to committed data loss.

Actually, this problem had been first raised in ZOOKEEPER-3911 . However, the 
fix of  ZOOKEEPER-3911  does not solve the problem at the root. The following 
trace can still be generated in the latest version nowadays.
h2. Trace

[^Trace-ZK-4646.pdf]

The trace is basically the same as the one in ZOOKEEPER-3911 (See the first 
comment provided by [~hanm] in that issue).  For convenience we use the zxid to 
represent a txn here.

Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
 - +S2+ is elected leader.
 - All of them have the same log with the last zxid <1, 3>.
 - +S2+ logs a new txn <1, 4> and makes a broadcast.
 - +S0+ & +S1+ crash before they receive the proposal of <1, 4>.
 - +S0+ & +S1+ restart.
 - +S2+ is elected leader again. 
 - +S0+ & +S1+ DIFF sync with +S2+ .
 - +S0+ & +S1+ send ACK-LD to +S2+ *before* their SyncThreads log txns to disk. 
(This is possible because txn logging is processed asynchronously! )
 - Verify clients of +S2+ have the view of <1, 4>.
 - The followers +S0+ & +S1+ crash *before* their SyncThreads persist txns to 
disk. (This is extremely timing sensitive but possible! )
 - +S0+ & +S1+ restart, and +S2+ crashes.
 - Verify clients of +S0+ & +S1+ do NOT have the view of <1, 4>, a violation of 
ZAB.

 

Extra note: The trace can be constructed with quorum nodes alive at any moment 
with careful time tuning of node shutdown & restart, e.g., let +S0+ & +S1+ 
shutdown and restart one by one in a short time.
h2. Analysis

*Root Cause:*

The root cause lies in the asynchronous executions by multi-threads.

When a follower replies ACK-LD, it should promise that it has already logged 
the initial history of the leader (according to ZAB). However, txn logging is 
executed by the SyncThread asynchronously, so the above promise can be 
violated. It is possible that, after the leader receives ACK-LD, believing that 
the responding follower has been in sync, and then gets into the BROADCAST 
phase, while in fact the history of the follower is not in sync yet. At this 
time, environment failures might prevent the follower from logging 
successfully. When that node with stale or incomplete committed history is 
elected leader later, it might lose txns that have been committed and applied 
on the former leader node.

The implementation adopts the multi-threading style for performance 
optimization. However, it may bring some underlying subtle bugs that will not 
occur at the protocol level. The fix of ZOOKEEPER-3911 simply calls 
{{logRequest(..)}} to submit the logging requests to SyncRequestProcessor's 
queue before replying ACK-LD inside the NEWLEADER processing logic, without 
further considering the risk of asynchronous executions by multi-threads. When 
the follower replies ACK-LD and then crashes before its SyncThread writes txns 
to disk, the problem is triggered.

*Property Violation:*

>From the server side, the committed log of the ensemble does not append 
>monotonically; different nodes have inconsistent committed logs. From the 
>client side, clients connected to different nodes may have inconsistent views. 
>A client may read stale data after a newer version is obtained. That newer 
>version can only be obtained from certain nodes of the ensemble rather than 
>all nodes. What's worse, that newer version may also be removed later.

Although ZOOKEEPER-4643 has similar symptoms and property violations, it should 
be regarded as a distinct problem because it has different root cause and risk 
pattern compared to this one. More specifically,
 * ZOOKEEPER-4643 : the risk lies in the order of updating currentEpoch before 
logging txns to disk. The bug can be triggered by interrupting the action of 
logging txns after currentEpoch is updated. 
 * ZOOKEEPER-4646 : the risk lies in the order of replying ACK-LD before 
logging txns to disk. The bug can be triggered by interrupting the action of 
logging txns after ACK-LD is replied. 

*Affected Versions:*

The above trace has been generated in multiple versions such as 3.7.1 & 3.8.0 
(the latest stable & current version till now) by our testing tools. The 
affected versions might be more, since the critical update order between the 
follower's replying ACK-LD and updating its history during SYNC stay 
non-deterministic in multiple versions.

 
h2. Possible Fix

Considering this issue and ZOOKEEPER-4643 , one possible fix is to guarantee 
the following partial orders to be satisfied:
 * A follower updates its {{_currentEpoch_}} only after it has persisted the 
txns that might be applied to the leader's datatree before the leader gets into 
the BROADCAST phase (so as to avoid the issue of ZOOKEEPER-4643 ). 
 * A follower replies ACK-LD only after it has persisted the txns that might be 
applied to the leader's datatree before the leader gets into the BROADCAST 
phase  (so as to avoid this issue).

  was:
When a follower is processing the NEWLEADER message in SYNC phase, it will call 
{{logRequest(..)}} to submit the txn persistence task to the 
SyncRequestProcessor thread. The SyncRequestProcessor thread will persist txns 
asynchronously and does not promise to finish the task before the follower 
replies ACK-LD (i.e. ACK of NEWLEADER) to the leader, which may lead to 
committed data loss.

Actually, this problem had been first raised in ZOOKEEPER-3911 . However, the 
fix of  ZOOKEEPER-3911  does not solve the problem at the root. The following 
trace can still be generated in the latest version nowadays.
h2. Trace

[^Trace-ZK-4646.pdf]

The trace is basically the same as the one in ZOOKEEPER-3911 (See the first 
comment provided by [~hanm] in that issue).  For convenience we use the zxid to 
represent a txn here.

Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
 - +S2+ is elected leader.
 - All of them have the same log with the last zxid <1, 3>.
 - +S2+ logs a new txn <1, 4> and makes a broadcast.
 - +S0+ & +S1+ crash before they receive the proposal of <1, 4>.
 - +S0+ & +S1+ restart.
 - +S2+ is elected leader again. 
 - +S0+ & +S1+ DIFF sync with +S2+ .
 - +S0+ & +S1+ send ACK-LD to +S2+ *before* their SyncRequestProcessor threads 
log txns to disk. (This is possible because txn logging is processed 
asynchronously! )
 - Verify clients of +S2+ have the view of <1, 4>.
 - The followers +S0+ & +S1+ crash *before* their SyncRequestProcessor threads 
persist txns to disk. (This is extremely timing sensitive but possible! )
 - +S0+ & +S1+ restart, and +S2+ crashes.
 - Verify clients of +S0+ & +S1+ do NOT have the view of <1, 4>, a violation of 
ZAB.

 

Extra note: The trace can be constructed with quorum nodes alive at any moment 
with careful time tuning of node shutdown & restart, e.g., let +S0+ & +S1+ 
shutdown and restart one by one in a short time.
h2. Analysis

*Root Cause:*

The root cause lies in the asynchronous executions by multi-threads.

When a follower replies ACK-LD, it should promise that it has already logged 
the initial history of the leader (according to ZAB). However, txn logging is 
executed by the SyncRequestProcessor thread asynchronously, so the above 
promise can be violated. It is possible that, after the leader receives ACK-LD, 
believing that the responding follower has been in sync, and then gets into the 
BROADCAST phase, while in fact the history of the follower is not in sync yet. 
At this time, environment failures might prevent the follower from logging 
successfully. When that node with stale or incomplete committed history is 
elected leader later, it might lose txns that have been committed and applied 
on the former leader node.

The implementation adopts the multi-threading style for performance 
optimization. However, it may bring some underlying subtle bugs that will not 
occur at the protocol level. The fix of ZOOKEEPER-3911 simply calls 
{{logRequest(..)}} to submit the logging requests to SyncRequestProcessor's 
queue before replying ACK-LD inside the NEWLEADER processing logic, without 
further considering the risk of asynchronous executions by multi-threads. When 
the follower replies ACK-LD and then crashes before its SyncRequestProcessor 
thread writes txns to disk, the problem is triggered.

*Property Violation:*

>From the server side, the committed log of the ensemble does not append 
>monotonically; different nodes have inconsistent committed logs. From the 
>client side, clients connected to different nodes may have inconsistent views. 
>A client may read stale data after a newer version is obtained. That newer 
>version can only be obtained from certain nodes of the ensemble rather than 
>all nodes. What's worse, that newer version may also be removed later.

Although ZOOKEEPER-4643 has similar symptoms and property violations, it should 
be regarded as a distinct problem because it has different root cause and risk 
pattern compared to this one. More specifically,
 * ZOOKEEPER-4643 : the risk lies in the order of updating currentEpoch before 
logging txns to disk. The bug can be triggered by interrupting the action of 
logging txns after currentEpoch is updated. 
 * ZOOKEEPER-4646 : the risk lies in the order of replying ACK-LD before 
logging txns to disk. The bug can be triggered by interrupting the action of 
logging txns after ACK-LD is replied. 

*Affected Versions:*

The above trace has been generated in multiple versions such as 3.7.1 & 3.8.0 
(the latest stable & current version till now) by our testing tools. The 
affected versions might be more, since the critical update order between the 
follower's replying ACK-LD and updating its history during SYNC stay 
non-deterministic in multiple versions.

 
h2. Possible Fix

Considering this issue and ZOOKEEPER-4643 , one possible fix is to guarantee 
the following partial orders to be satisfied:
 * A follower updates its {{_currentEpoch_}} only after it has persisted the 
txns that might be applied to the leader's datatree before the leader gets into 
the BROADCAST phase (so as to avoid the issue of ZOOKEEPER-4643 ). 
 * A follower replies ACK-LD only after it has persisted the txns that might be 
applied to the leader's datatree before the leader gets into the BROADCAST 
phase  (so as to avoid this issue).


> Committed txns may still be lost if followers crash after replying ACK-LD but 
> before writing txns to disk
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4646
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4646
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.6.3, 3.7.0, 3.8.0, 3.7.1, 3.8.1
>            Reporter: Sirius
>            Priority: Critical
>         Attachments: Trace-ZK-4646.pdf
>
>
> When a follower is processing the NEWLEADER message in SYNC phase, it will 
> call {{logRequest(..)}} to submit the txn persistence task to the SyncThread. 
> The SyncThread will persist txns asynchronously and does not promise to 
> finish the task before the follower replies ACK-LD (i.e. ACK of NEWLEADER) to 
> the leader, which may lead to committed data loss.
> Actually, this problem had been first raised in ZOOKEEPER-3911 . However, the 
> fix of  ZOOKEEPER-3911  does not solve the problem at the root. The following 
> trace can still be generated in the latest version nowadays.
> h2. Trace
> [^Trace-ZK-4646.pdf]
> The trace is basically the same as the one in ZOOKEEPER-3911 (See the first 
> comment provided by [~hanm] in that issue).  For convenience we use the zxid 
> to represent a txn here.
> Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
>  - +S2+ is elected leader.
>  - All of them have the same log with the last zxid <1, 3>.
>  - +S2+ logs a new txn <1, 4> and makes a broadcast.
>  - +S0+ & +S1+ crash before they receive the proposal of <1, 4>.
>  - +S0+ & +S1+ restart.
>  - +S2+ is elected leader again. 
>  - +S0+ & +S1+ DIFF sync with +S2+ .
>  - +S0+ & +S1+ send ACK-LD to +S2+ *before* their SyncThreads log txns to 
> disk. (This is possible because txn logging is processed asynchronously! )
>  - Verify clients of +S2+ have the view of <1, 4>.
>  - The followers +S0+ & +S1+ crash *before* their SyncThreads persist txns to 
> disk. (This is extremely timing sensitive but possible! )
>  - +S0+ & +S1+ restart, and +S2+ crashes.
>  - Verify clients of +S0+ & +S1+ do NOT have the view of <1, 4>, a violation 
> of ZAB.
>  
> Extra note: The trace can be constructed with quorum nodes alive at any 
> moment with careful time tuning of node shutdown & restart, e.g., let +S0+ & 
> +S1+ shutdown and restart one by one in a short time.
> h2. Analysis
> *Root Cause:*
> The root cause lies in the asynchronous executions by multi-threads.
> When a follower replies ACK-LD, it should promise that it has already logged 
> the initial history of the leader (according to ZAB). However, txn logging is 
> executed by the SyncThread asynchronously, so the above promise can be 
> violated. It is possible that, after the leader receives ACK-LD, believing 
> that the responding follower has been in sync, and then gets into the 
> BROADCAST phase, while in fact the history of the follower is not in sync 
> yet. At this time, environment failures might prevent the follower from 
> logging successfully. When that node with stale or incomplete committed 
> history is elected leader later, it might lose txns that have been committed 
> and applied on the former leader node.
> The implementation adopts the multi-threading style for performance 
> optimization. However, it may bring some underlying subtle bugs that will not 
> occur at the protocol level. The fix of ZOOKEEPER-3911 simply calls 
> {{logRequest(..)}} to submit the logging requests to SyncRequestProcessor's 
> queue before replying ACK-LD inside the NEWLEADER processing logic, without 
> further considering the risk of asynchronous executions by multi-threads. 
> When the follower replies ACK-LD and then crashes before its SyncThread 
> writes txns to disk, the problem is triggered.
> *Property Violation:*
> From the server side, the committed log of the ensemble does not append 
> monotonically; different nodes have inconsistent committed logs. From the 
> client side, clients connected to different nodes may have inconsistent 
> views. A client may read stale data after a newer version is obtained. That 
> newer version can only be obtained from certain nodes of the ensemble rather 
> than all nodes. What's worse, that newer version may also be removed later.
> Although ZOOKEEPER-4643 has similar symptoms and property violations, it 
> should be regarded as a distinct problem because it has different root cause 
> and risk pattern compared to this one. More specifically,
>  * ZOOKEEPER-4643 : the risk lies in the order of updating currentEpoch 
> before logging txns to disk. The bug can be triggered by interrupting the 
> action of logging txns after currentEpoch is updated. 
>  * ZOOKEEPER-4646 : the risk lies in the order of replying ACK-LD before 
> logging txns to disk. The bug can be triggered by interrupting the action of 
> logging txns after ACK-LD is replied. 
> *Affected Versions:*
> The above trace has been generated in multiple versions such as 3.7.1 & 3.8.0 
> (the latest stable & current version till now) by our testing tools. The 
> affected versions might be more, since the critical update order between the 
> follower's replying ACK-LD and updating its history during SYNC stay 
> non-deterministic in multiple versions.
>  
> h2. Possible Fix
> Considering this issue and ZOOKEEPER-4643 , one possible fix is to guarantee 
> the following partial orders to be satisfied:
>  * A follower updates its {{_currentEpoch_}} only after it has persisted the 
> txns that might be applied to the leader's datatree before the leader gets 
> into the BROADCAST phase (so as to avoid the issue of ZOOKEEPER-4643 ). 
>  * A follower replies ACK-LD only after it has persisted the txns that might 
> be applied to the leader's datatree before the leader gets into the BROADCAST 
> phase  (so as to avoid this issue).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to