[
https://issues.apache.org/jira/browse/ZOOKEEPER-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sirius updated ZOOKEEPER-4685:
------------------------------
Description:
When a follower is processing the NEWLEADER message in SYNC phase, its
QuorumPeer thread will call {{logRequest(..)}} to submit the txn persistence
task to the SyncThread. The SyncThread may persist txns and reply ACKs of them
before replying ACK-LD (i.e. ACK of NEWLEADER) to the leader. This may cause
the consequence that the leader cannot collect enough number of ACK-LDs
successfully, followed by the leader's shutdown and a new round of election.
This introduces unnecessary recovery procedures, consumes extra time before
servers get into the BROADCAST phase and reduces the service's availability a
lot.
The following trace can be generated in the latest version nowadays.
h2. Trace
Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
- +S2+ is elected leader.
- +S2+ logs a new txn <1, 1> and makes a broadcast.
- +S0+ restarts & +S1+ crashes before receiving the proposal of <1, 1>.
- +S2+ is elected leader again.
- +S2+ syncs with +S0+ using DIFF, and sends the proposal of <1, 1> during
SYNC.
- After +S0+ receives NEWLEADER, {+}S0{+}'s SyncThread may persist the txn
<1, 1> and reply corresponding ACK to the leader +S2+ before {+}S0{+}'s
QuorumPeer thread replies ACK-LD to the leader +S2+ .(This is possible because
txn logging is processed asynchronously by SyncThread! )
- The corresponding LearnerHandler on +S2+ cannot recognize the ACK of some
proposal before ACK-LD, and will be blocked at _waitForStartup()_ until the
leader turns its state to {_}state.RUNNING{_}.
- However, the QuorumPeer of the leader +S2+ will not receive enough number of
ACK-LDs before timeout, and then throws _InterruptedException_ during
{_}waitForNewLeaderAck(..){_}.
- After that, the leader will shutdown and a new round of election is raised,
which consumes extra time for establishing the quorum and reduces availability
a lot.
h2. Possible Fix
Considering this issue and ZOOKEEPER-4646 , one possible fix is to guarantee
the following partial orders to be satisfied:
* The follower replies ACK of PROPOSAL only after it replies ACK-LD (i.e. ACK
of NEWLEADER) to the leader (so as to avoid this issue).
* The follower replies ACK-LD only after it has persisted the txns that might
be applied to the leader's datatree before the leader gets into the BROADCAST
phase (to avoid ZOOKEEPER-4646).
was:
When a follower is processing the NEWLEADER message in SYNC phase, its
QuorumPeer thread will call {{logRequest(..)}} to submit the txn persistence
task to the SyncThread. The SyncThread may persist txns and reply ACKs of them
before replying ACK-LD (i.e. ACK of NEWLEADER) to the leader. This may cause
the consequence that the leader cannot collect enough number of ACK-LDs
successfully, followed by the leader's shutdown and a new round of election.
This introduces unnecessary recovery procedures, consumes extra time before
servers get into the BROADCAST phase and reduces the service's availability a
lot.
The following trace can be generated in the latest version nowadays.
h2. Trace
Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
- +S2+ is elected leader.
- +S2+ logs a new txn <1, 1> and makes a broadcast.
- +S0+ restarts & +S1+ crashes before receiving the proposal of <1, 1>.
- +S2+ is elected leader again.
- +S2+ syncs with +S0+ using DIFF, and sends the proposal of <1, 1> during
SYNC.
- After +S0+ receives NEWLEADER, {+}S0{+}'s SyncThread may persist the txn
<1, 1> and reply corresponding ACK to the leader +S2+ before {+}S0{+}'s
QuorumPeer thread replies ACK-LD to the leader +S2+ .(This is possible because
txn logging is processed asynchronously by SyncThread! )
- The corresponding LearnerHandler on +S2+ cannot recognize the ACK of some
proposal before ACK-LD, and will be blocked at _waitForStartup()_ until the
leader turns its state to {_}state.RUNNING{_}.
- However, the QuorumPeer of the leader +S2+ will not receive enough number of
ACK-LDs before timeout, and then throws _InterruptedException_ during
{_}waitForNewLeaderAck(..){_}.
- After that, the leader will shutdown and a new round of election is raised,
which consumes extra time for establishing the quorum and reduces availability
a lot.
h2. Possible Fix
Considering this issue and ZOOKEEPER-4646 , one possible fix is to guarantee
the following partial orders to be satisfied:
* The follower replies ACK of PROPOSAL only after it replies ACK-LD (i.e. ACK
of NEWLEADER) to the leader (so as to avoid this issue).
* The follower replies ACK-LD only after it has persisted the txns that might
be applied to the leader's datatree before the leader gets into the BROADCAST
phase (to avoid ZOOKEEPER-4646).
*
> Unnecessary system unavailability due to Leader shutdown when follower sent
> ACK of PROPOSAL before sending ACK of NEWLEADER in log recovery
> -------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-4685
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4685
> Project: ZooKeeper
> Issue Type: Bug
> Components: quorum, server
> Affects Versions: 3.6.3, 3.7.0, 3.8.0, 3.7.1, 3.8.1
> Reporter: Sirius
> Priority: Major
>
> When a follower is processing the NEWLEADER message in SYNC phase, its
> QuorumPeer thread will call {{logRequest(..)}} to submit the txn persistence
> task to the SyncThread. The SyncThread may persist txns and reply ACKs of
> them before replying ACK-LD (i.e. ACK of NEWLEADER) to the leader. This may
> cause the consequence that the leader cannot collect enough number of ACK-LDs
> successfully, followed by the leader's shutdown and a new round of election.
> This introduces unnecessary recovery procedures, consumes extra time before
> servers get into the BROADCAST phase and reduces the service's availability a
> lot.
> The following trace can be generated in the latest version nowadays.
>
> h2. Trace
> Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
> - +S2+ is elected leader.
> - +S2+ logs a new txn <1, 1> and makes a broadcast.
> - +S0+ restarts & +S1+ crashes before receiving the proposal of <1, 1>.
> - +S2+ is elected leader again.
> - +S2+ syncs with +S0+ using DIFF, and sends the proposal of <1, 1> during
> SYNC.
> - After +S0+ receives NEWLEADER, {+}S0{+}'s SyncThread may persist the txn
> <1, 1> and reply corresponding ACK to the leader +S2+ before {+}S0{+}'s
> QuorumPeer thread replies ACK-LD to the leader +S2+ .(This is possible
> because txn logging is processed asynchronously by SyncThread! )
> - The corresponding LearnerHandler on +S2+ cannot recognize the ACK of some
> proposal before ACK-LD, and will be blocked at _waitForStartup()_ until the
> leader turns its state to {_}state.RUNNING{_}.
> - However, the QuorumPeer of the leader +S2+ will not receive enough number
> of ACK-LDs before timeout, and then throws _InterruptedException_ during
> {_}waitForNewLeaderAck(..){_}.
> - After that, the leader will shutdown and a new round of election is
> raised, which consumes extra time for establishing the quorum and reduces
> availability a lot.
>
> h2. Possible Fix
> Considering this issue and ZOOKEEPER-4646 , one possible fix is to guarantee
> the following partial orders to be satisfied:
> * The follower replies ACK of PROPOSAL only after it replies ACK-LD (i.e.
> ACK of NEWLEADER) to the leader (so as to avoid this issue).
> * The follower replies ACK-LD only after it has persisted the txns that
> might be applied to the leader's datatree before the leader gets into the
> BROADCAST phase (to avoid ZOOKEEPER-4646).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)