[
https://issues.apache.org/jira/browse/ZOOKEEPER-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695459#comment-16695459
]
Michael K. Edwards commented on ZOOKEEPER-2745:
-----------------------------------------------
Is this still potentially an issue in 3.5.5? Or can it be closed?
> Node loses data after disk-full event, but successfully joins Quorum
> --------------------------------------------------------------------
>
> Key: ZOOKEEPER-2745
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2745
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Affects Versions: 3.4.6
> Environment: Ubuntu 12.04
> Reporter: Abhay Bothra
> Priority: Critical
> Attachments: ZOOKEEPER-2745.patch
>
>
> If disk is full on 1 zookeeper node in a 3 node ensemble, it is able to join
> the quorum with partial data.
> Setup:
> --------
> - Running a 3 node zookeeper ensemble on Ubuntu 12.04 as upstart services.
> Let's call the nodes: A, B and C.
> Observation:
> -----------------
> - Connecting to 2 (Node A and B) of the 3 nodes and doing an `ls` in
> zookeeper data directory was giving:
> /foo
> /bar
> /baz
> But an `ls` on node C was giving:
> /baz
> - On node C, the zookeeper data directory had the following files:
> log.1001
> log.1600
> snapshot.1000 -> size 200
> snapshot.1200 -> size 269
> snapshot.1300 -> size 300
> - Snapshot sizes on node A and B were in the vicinity of 500KB
> RCA
> -------
> - Disk was full on node C prior to the creation time of the small snapshot
> files.
> - Looking at zookeeper server logs, we observed that zookeeper had crashed
> and restarted a few times after the first instance of disk full. Everytime
> time zookeeper starts, it does 3 things:
> 1. Run the purge task to cleanup old snapshot and txn logs. Our
> autopurge.snapRetainCount is set to 3.
> 2. Restore from the most recent valid snapshot and the txn logs that follow.
> 3. Take part in a leader election - realize it has missed something -
> become the follower - get diff of missed txns from the current leader -
> create a new snapshot of its current state.
> - We confirmed that a valid snapshot of the system had existed prior to, and
> immediately after the crash. Let's call this snapshot snapshot.800.
> - Over the next 3 restarts, zookeeper did the following:
> - Purged older snapshots
> - Restored from snapshot.800 + txn logs
> - Synced up with master, tried to write its updated state to a new
> snapshot. Crashed due to disk full. The snapshot file, even though invalid,
> had been created.
> - *Note*: This is the first source of the bug. It might be more appropriate
> to first write the snapshot to a temporary file, and then rename it
> snapshot.<txn_id>. That would gives us more confidence in the validity of
> snapshots in the data dir.
> - Let's say the snapshot files created above were snapshot.850, snapshot.920
> and snapshot.950
> - On the 4th restart, the purge task retained the 3 recent snapshots -
> snapshot.850, snapshot.920, and snapshot.950, and proceeded to purge
> snapshot.800 and associated txn logs assuming that they were no longer needed.
> - *Note*: This is the second source of the bug. Instead of retaining the 3
> most recent *valid* snapshots, the server just retains 3 most recent
> snapshots, regardless of their validity.
> - When restoring, zookeeper doesn't find any valid snapshot logs to restore
> from. So it tries to reload its state from txn logs starting at zxid 0.
> However, those transaction logs would have long ago been garbage collected.
> It reloads from whatever txn logs are present. Let's say the only txn log
> file present (log.951) contains logs for zxid 951 to 998. It reloads from
> that log file, syncs with master - gets txns 999 and 1000, and writes the
> snapshot log snapshot.1000 to disk. Now that we have deleted snapshot.800, we
> have enough free disk space to write snapshot.1000. From this state onwards,
> zookeeper will always assume it has the state till txn id 1000, even though
> it only has state from txn id 951 to 1000.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)