[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Damien Diederen updated ZOOKEEPER-4785:
---------------------------------------
    Fix Version/s: 3.9.2
                   3.10

> Txn loss due to race condition in Learner.syncWithLeader() during DIFF sync
> ---------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4785
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4785
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.8.0, 3.7.1, 3.8.1, 3.7.2, 3.8.2, 3.9.1
>            Reporter: Li Wang
>            Assignee: Li Wang
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.9.2, 3.10
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> We had txn loss incident in production recently. After investigation, we 
> found it was caused by the race condition of follower writing the current 
> epoch and sending the ACK_LD before successfully persisting all the txns from 
> DIFF sync in Learner.syncWithLeader() method.
> {code:java}
> case Leader.NEWLEADER: 
>         ...
>         self.setCurrentEpoch(newEpoch);
>         writeToTxnLog = true;
>         //Anything after this needs to go to the transaction log, not applied 
> directly in memory
>         isPreZAB1_0 = false;
>         // ZOOKEEPER-3911: make sure sync the uncommitted logs before commit 
> them (ACK NEWLEADER).
>         sock.setSoTimeout(self.tickTime * self.syncLimit);
>         self.setSyncMode(QuorumPeer.SyncMode.NONE);
>         zk.startupWithoutServing();
>         if (zk instanceof FollowerZooKeeperServer) {
>             FollowerZooKeeperServer fzk = (FollowerZooKeeperServer) zk;
>             for (PacketInFlight p : packetsNotCommitted) {
>               fzk.logRequest(p.hdr, p.rec, p.digest);
>             }
>             packetsNotCommitted.clear();
>         }
>         writePacket(new QuorumPacket(Leader.ACK, newLeaderZxid, null, null), 
> true);
>         break;
>     }
> {code}
> In this method, when follower receives the NEWLEADER msg, the current epoch 
> is updated before writing the uncommitted txns to the disk and writing txns 
> is done asynchronously by the SyncThreadd.  If follower crashes after setting 
> the current epoch and sending ACK_LD and before all transactions are 
> successfully written to disk, transactions loss can happen.  
> This is because leader election is based on epoch first and then transaction 
> id.  When the follower becomes a leader because it has highest epoch, it will 
> ask the other followers to truncate txns even they have been written to disk, 
> causing data loss.
> The following is the scenario
> 1. Leader election happened
> 2. A follower synced with Leader via DIFF, received committed proposals from 
> leader and kept them in memory
> 3. The follower received the NEWLEADER message
> 4. The follower updated the newEpoch
> 5. The follower was bounced  before writing all the uncommitted txns to disk
> 6. Leader shutdown and a new election triggered
> 7. Follower became the new leader because it has largest currentEpoch
> 8. New leader asked other followers to truncate their committed txns and 
> transactions got lost



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to