[jira] [Updated] (ZOOKEEPER-4685) Unnecessary system unavailability due to Leader shutdown when follower sent ACK of PROPOSAL before sending ACK of NEWLEADER in log recovery

Sirius (Jira) Fri, 24 Mar 2023 21:01:28 -0700


     [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sirius updated ZOOKEEPER-4685:
------------------------------
    Description: 
When a follower is processing the NEWLEADER message in SYNC phase, its 
QuorumPeer thread will call {{logRequest(..)}} to submit the txn persistence 
task to the SyncThread. The SyncThread may persist txns and reply ACKs of them 
before replying ACK-LD (i.e. ACK of NEWLEADER) to the leader. This may cause 
the consequence that the leader cannot collect enough number of ACK-LDs 
successfully, followed by the leader's shutdown and a new round of election. 
This introduces unnecessary recovery procedures, consumes extra time before 
servers get into the BROADCAST phase and reduces the service's availability a 
lot. 

The following trace can be generated in the latest version nowadays.

 
h2. Trace

Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
 - +S2+ is elected leader.
 - +S2+ logs a new txn <1, 1> and makes a broadcast.
 - +S0+ restarts & +S1+ crashes before receiving the proposal of <1, 1>.
 - +S2+ is elected leader again. 
 - +S2+ syncs with +S0+ using DIFF, and sends the proposal of <1, 1> during 
SYNC.
 - After +S0+ receives NEWLEADER,  {+}S0{+}'s SyncThread may persist the txn 
<1, 1> and reply corresponding ACK to the leader +S2+ before {+}S0{+}'s 
QuorumPeer thread replies ACK-LD to the leader +S2+ .(This is possible because 
txn logging is processed asynchronously by SyncThread! )
 - The corresponding LearnerHandler on +S2+ cannot recognize the ACK of some 
proposal before ACK-LD, and will be blocked at _waitForStartup()_ until the 
leader turns its state to {_}state.RUNNING{_}.
 - However, the QuorumPeer of the leader +S2+ will not receive enough number of 
ACK-LDs before timeout, and then throws _InterruptedException_ during 
{_}waitForNewLeaderAck(..){_}.
 - After that, the leader will shutdown and a new round of election is raised, 
which consumes extra time for establishing the quorum and reduces availability 
a lot.

 
h2. Possible Fix

Considering this issue and ZOOKEEPER-4646 , one possible fix is to guarantee 
the following partial orders to be satisfied:
 * The follower replies ACK of PROPOSAL only after it replies ACK-LD (i.e. ACK 
of NEWLEADER) to the leader (so as to avoid this issue).
 * The follower replies ACK-LD only after it has persisted the txns that might 
be applied to the leader's datatree before the leader gets into the BROADCAST 
phase (to avoid ZOOKEEPER-4646).

  was:
When a follower is processing the NEWLEADER message in SYNC phase, its 
QuorumPeer thread will call {{logRequest(..)}} to submit the txn persistence 
task to the SyncThread. The SyncThread may persist txns and reply ACKs of them 
before replying ACK-LD (i.e. ACK of NEWLEADER) to the leader. This may cause 
the consequence that the leader cannot collect enough number of ACK-LDs 
successfully, followed by the leader's shutdown and a new round of election. 
This introduces unnecessary recovery procedures, consumes extra time before 
servers get into the BROADCAST phase and reduces the service's availability a 
lot. 

The following trace can be generated in the latest version nowadays.

 
h2. Trace

Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
 - +S2+ is elected leader.
 - +S2+ logs a new txn <1, 1> and makes a broadcast.
 - +S0+ restarts & +S1+ crashes before receiving the proposal of <1, 1>.
 - +S2+ is elected leader again. 
 - +S2+ syncs with +S0+ using DIFF, and sends the proposal of <1, 1> during 
SYNC.
 - After +S0+ receives NEWLEADER,  {+}S0{+}'s SyncThread may persist the txn 
<1, 1> and reply corresponding ACK to the leader +S2+ before {+}S0{+}'s 
QuorumPeer thread replies ACK-LD to the leader +S2+ .(This is possible because 
txn logging is processed asynchronously by SyncThread! )
 - The corresponding LearnerHandler on +S2+ cannot recognize the ACK of some 
proposal before ACK-LD, and will be blocked at _waitForStartup()_ until the 
leader turns its state to {_}state.RUNNING{_}.
 - However, the QuorumPeer of the leader +S2+ will not receive enough number of 
ACK-LDs before timeout, and then throws _InterruptedException_ during 
{_}waitForNewLeaderAck(..){_}.
 - After that, the leader will shutdown and a new round of election is raised, 
which consumes extra time for establishing the quorum and reduces availability 
a lot.

 
h2. Possible Fix

Considering this issue and ZOOKEEPER-4646 , one possible fix is to guarantee 
the following partial orders to be satisfied:
 * The follower replies ACK of PROPOSAL only after it replies ACK-LD (i.e. ACK 
of NEWLEADER) to the leader (so as to avoid this issue).
 * The follower replies ACK-LD only after it has persisted the txns that might 
be applied to the leader's datatree before the leader gets into the BROADCAST 
phase (to avoid ZOOKEEPER-4646).
 *  


> Unnecessary system unavailability due to Leader shutdown when follower sent 
> ACK of PROPOSAL before sending ACK of NEWLEADER in log recovery
> -------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4685
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4685
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.6.3, 3.7.0, 3.8.0, 3.7.1, 3.8.1
>            Reporter: Sirius
>            Priority: Major
>
> When a follower is processing the NEWLEADER message in SYNC phase, its 
> QuorumPeer thread will call {{logRequest(..)}} to submit the txn persistence 
> task to the SyncThread. The SyncThread may persist txns and reply ACKs of 
> them before replying ACK-LD (i.e. ACK of NEWLEADER) to the leader. This may 
> cause the consequence that the leader cannot collect enough number of ACK-LDs 
> successfully, followed by the leader's shutdown and a new round of election. 
> This introduces unnecessary recovery procedures, consumes extra time before 
> servers get into the BROADCAST phase and reduces the service's availability a 
> lot. 
> The following trace can be generated in the latest version nowadays.
>  
> h2. Trace
> Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
>  - +S2+ is elected leader.
>  - +S2+ logs a new txn <1, 1> and makes a broadcast.
>  - +S0+ restarts & +S1+ crashes before receiving the proposal of <1, 1>.
>  - +S2+ is elected leader again. 
>  - +S2+ syncs with +S0+ using DIFF, and sends the proposal of <1, 1> during 
> SYNC.
>  - After +S0+ receives NEWLEADER,  {+}S0{+}'s SyncThread may persist the txn 
> <1, 1> and reply corresponding ACK to the leader +S2+ before {+}S0{+}'s 
> QuorumPeer thread replies ACK-LD to the leader +S2+ .(This is possible 
> because txn logging is processed asynchronously by SyncThread! )
>  - The corresponding LearnerHandler on +S2+ cannot recognize the ACK of some 
> proposal before ACK-LD, and will be blocked at _waitForStartup()_ until the 
> leader turns its state to {_}state.RUNNING{_}.
>  - However, the QuorumPeer of the leader +S2+ will not receive enough number 
> of ACK-LDs before timeout, and then throws _InterruptedException_ during 
> {_}waitForNewLeaderAck(..){_}.
>  - After that, the leader will shutdown and a new round of election is 
> raised, which consumes extra time for establishing the quorum and reduces 
> availability a lot.
>  
> h2. Possible Fix
> Considering this issue and ZOOKEEPER-4646 , one possible fix is to guarantee 
> the following partial orders to be satisfied:
>  * The follower replies ACK of PROPOSAL only after it replies ACK-LD (i.e. 
> ACK of NEWLEADER) to the leader (so as to avoid this issue).
>  * The follower replies ACK-LD only after it has persisted the txns that 
> might be applied to the leader's datatree before the leader gets into the 
> BROADCAST phase (to avoid ZOOKEEPER-4646).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ZOOKEEPER-4685) Unnecessary system unavailability due to Leader shutdown when follower sent ACK of PROPOSAL before sending ACK of NEWLEADER in log recovery

Reply via email to