Brian Nixon created ZOOKEEPER-2872:
--------------------------------------

             Summary: Interrupted snapshot sync causes data loss
                 Key: ZOOKEEPER-2872
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872
             Project: ZooKeeper
          Issue Type: Bug
          Components: server
    Affects Versions: 3.5.3, 3.4.10, 3.6.0
            Reporter: Brian Nixon


There is a way for observers to permanently lose data from their local data 
tree while remaining members of good standing with the ensemble and continuing 
to serve client traffic when the following chain of events occurs.

1. The observer dies in epoch N from machine failure.
2. The observer comes back up in epoch N+1 and requests a snapshot sync to 
catch up.
3. The machine powers off before the snapshot is synced to disc and after some 
txn's have been logged (depending on the OS, this can happen!).
4. The observer comes back a second time and replays its most recent snapshot 
(epoch <= N) as well as the txn logs (epoch N+1). 
5. A diff sync is requested from the leader and the observer broadcasts 
availability.

In this scenario, any commits from epoch N that the observer did not receive 
before it died the first time will never be exposed to the observer and no part 
of the ensemble will complain. 

This situation is not unique to observers and can happen to any learner. As a 
simple fix, fsync-ing the snapshots received from the leader will avoid the 
case of missing snapshots causing data loss.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to