[jira] [Created] (ZOOKEEPER-4882) Data loss after restarting an node experienced temporary disk error and rejoin

Kezhu Wang (Jira) Sat, 26 Oct 2024 20:10:13 -0700

Kezhu Wang created ZOOKEEPER-4882:
-------------------------------------

             Summary: Data loss after restarting an node experienced temporary 
disk error and rejoin
                 Key: ZOOKEEPER-4882
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4882
             Project: ZooKeeper
          Issue Type: Bug
          Components: server
    Affects Versions: 3.9.3, 3.8.4
            Reporter: Kezhu Wang



    The cause is multifold:
    1. Leader will commit a proposal once quorum acked.
    2. Proposal is able to be committed in node's memory even if it has not
       been written to that node's disk.
    3. In case of disk error, the txn log could lag behind memory database.

    The above applies to both leader and follower. I have not verified leader 
branch, let's consider only follower for now.

    f4. A follower experienced temporary disk error will have hole in txn log
       after re-join.
    f5. Restarted follower will lose the data. Worse, it is able to win
       election and propagate data loss to whole cluster.

I authored commits in my repo to expose this.

https://github.com/kezhuw/zookeeper/commits/data-loss-temporary-sync-disk-error/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ZOOKEEPER-4882) Data loss after restarting an node experienced temporary disk error and rejoin

Reply via email to