Re: [PR] HDDS-13003. [Design Doc] Snapshot Defragmentation to reduce storage footprint [ozone]

via GitHub Tue, 12 Aug 2025 11:23:42 -0700


smengcl commented on code in PR #8514:
URL: https://github.com/apache/ozone/pull/8514#discussion_r2270757627



##########
hadoop-hdds/docs/content/feature/SnapshotCompaction.md:
##########
@@ -0,0 +1,87 @@
+# Improving Snapshot Scale:
+
+[HDDS-13003](https://issues.apache.org/jira/browse/HDDS-13003)
+
+# Problem Statement
+
+In Apache Ozone, snapshots currently take a checkpoint of the Active Object 
Store (AOS) RocksDB each time a snapshot is created and track the compaction of 
SST files over time. This model works efficiently when snapshots are 
short-lived, as they merely serve as hard links to the AOS RocksDB. However, 
over time, if an older snapshot persists while significant churn occurs in the 
AOS RocksDB (due to compactions and writes), the snapshot RocksDB may diverge 
significantly from both the AOS RocksDB and other snapshot RocksDB instances. 
This divergence increases storage requirements linearly with the number of 
snapshots.
+
+# Solution Proposal:
+
+The primary inefficiency in the current snapshotting mechanism stems from 
constant RocksDB compactions in AOS, which can cause a key, file, or directory 
entry to appear in multiple SST files. Ideally, each unique key, file, or 
directory entry should reside in only one SST file, eliminating redundant 
storage and mitigating the multiplier effect caused by snapshots. If 
implemented correctly, the total RocksDB size would be proportional to the 
total number of unique keys in the system rather than the number of snapshots.
+
+## Snapshot Compaction:
+
+Currently, automatic RocksDB compactions are disabled for snapshot RocksDB to 
preserve snapshot diff performance, preventing any form of compaction. However, 
snapshots can be compacted if the next snapshot in the chain is a checkpoint of 
the previous snapshot plus a diff stored in a separate SST file. The proposed 
approach involves rewriting snapshots iteratively from the beginning of the 
snapshot chain and restructuring them in a separate directory. P.S This has got 
nothing to do with compacting snapshot’s rocksdb, we are not going to enable 
rocksdb auto compaction on snapshot rocksdb.
+
+1. ### Introducing a last compaction time:
+
+   A new boolean flag (`needsCompaction`) and timestamp (`lastCompactionTime`) 
will be added to snapshot metadata. If absent, `needsCompaction` will default 
to `true`.   
+   A new list of Map\<String, List\<Longs\>\> (`sstFiles`) also needs to be 
added to snapshot info; this would be storing the original list of sst files in 
the uncompacted copy of the snapshot corresponding to 
keyTable/fileTable/DirectoryTable.  
+   Since this is not going to be consistent across all OMs this would have to 
be written to a local yaml file inside the snapshot directory and this can be 
maintained in the SnapshotChainManager in memory on startup. So all updates 
should not go via ratis.
+
+2. ### Snapshot Cache Lock for Read Prevention
+
+   A snapshot lock will be introduced in the snapshot cache to prevent reads 
on a specific snapshot during compaction. This ensures no active reads occur 
while replacing the underlying RocksDB instance.
+
+3. ### Directory Structure Changes
+
+   Snapshots currently reside in the `db.checkpoints` directory. The proposal 
introduces a `db.checkpoints.compacted` directory for compacted snapshots. The 
directory format should be as follows:

Review Comment:
   Actually, snapshots are under `db.snapshots`. e.g.:
   
   ```
   /var/lib/hadoop-ozone/om/data/db.snapshots
   /var/lib/hadoop-ozone/om/data/db.snapshots/diffState
   /var/lib/hadoop-ozone/om/data/db.snapshots/diffState/compaction-sst-backup
   /var/lib/hadoop-ozone/om/data/db.snapshots/diffState/snapDiff
   /var/lib/hadoop-ozone/om/data/db.snapshots/diffState/compaction-log
   /var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState
   ```
   
   Snapshots would be under `./db.snapshots/checkpointState`



##########
hadoop-hdds/docs/content/feature/SnapshotCompaction.md:
##########
@@ -0,0 +1,87 @@
+# Improving Snapshot Scale:
+
+[HDDS-13003](https://issues.apache.org/jira/browse/HDDS-13003)
+
+# Problem Statement
+
+In Apache Ozone, snapshots currently take a checkpoint of the Active Object 
Store (AOS) RocksDB each time a snapshot is created and track the compaction of 
SST files over time. This model works efficiently when snapshots are 
short-lived, as they merely serve as hard links to the AOS RocksDB. However, 
over time, if an older snapshot persists while significant churn occurs in the 
AOS RocksDB (due to compactions and writes), the snapshot RocksDB may diverge 
significantly from both the AOS RocksDB and other snapshot RocksDB instances. 
This divergence increases storage requirements linearly with the number of 
snapshots.
+
+# Solution Proposal:
+
+The primary inefficiency in the current snapshotting mechanism stems from 
constant RocksDB compactions in AOS, which can cause a key, file, or directory 
entry to appear in multiple SST files. Ideally, each unique key, file, or 
directory entry should reside in only one SST file, eliminating redundant 
storage and mitigating the multiplier effect caused by snapshots. If 
implemented correctly, the total RocksDB size would be proportional to the 
total number of unique keys in the system rather than the number of snapshots.
+
+## Snapshot Compaction:
+
+Currently, automatic RocksDB compactions are disabled for snapshot RocksDB to 
preserve snapshot diff performance, preventing any form of compaction. However, 
snapshots can be compacted if the next snapshot in the chain is a checkpoint of 
the previous snapshot plus a diff stored in a separate SST file. The proposed 
approach involves rewriting snapshots iteratively from the beginning of the 
snapshot chain and restructuring them in a separate directory. P.S This has got 
nothing to do with compacting snapshot’s rocksdb, we are not going to enable 
rocksdb auto compaction on snapshot rocksdb.
+
+1. ### Introducing a last compaction time:
+
+   A new boolean flag (`needsCompaction`) and timestamp (`lastCompactionTime`) 
will be added to snapshot metadata. If absent, `needsCompaction` will default 
to `true`.   
+   A new list of Map\<String, List\<Longs\>\> (`sstFiles`) also needs to be 
added to snapshot info; this would be storing the original list of sst files in 
the uncompacted copy of the snapshot corresponding to 
keyTable/fileTable/DirectoryTable.  
+   Since this is not going to be consistent across all OMs this would have to 
be written to a local yaml file inside the snapshot directory and this can be 
maintained in the SnapshotChainManager in memory on startup. So all updates 
should not go via ratis.
+
+2. ### Snapshot Cache Lock for Read Prevention
+
+   A snapshot lock will be introduced in the snapshot cache to prevent reads 
on a specific snapshot during compaction. This ensures no active reads occur 
while replacing the underlying RocksDB instance.
+
+3. ### Directory Structure Changes
+
+   Snapshots currently reside in the `db.checkpoints` directory. The proposal 
introduces a `db.checkpoints.compacted` directory for compacted snapshots. The 
directory format should be as follows:

Review Comment:
   Actually, snapshots are under `db.snapshots`. e.g.:
   
   ```
   /var/lib/hadoop-ozone/om/data/db.snapshots
   /var/lib/hadoop-ozone/om/data/db.snapshots/diffState
   /var/lib/hadoop-ozone/om/data/db.snapshots/diffState/compaction-sst-backup
   /var/lib/hadoop-ozone/om/data/db.snapshots/diffState/snapDiff
   /var/lib/hadoop-ozone/om/data/db.snapshots/diffState/compaction-log
   /var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState
   ```
   
   Newly created snapshots would be under `./db.snapshots/checkpointState`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-13003. [Design Doc] Snapshot Defragmentation to reduce storage footprint [ozone]

Reply via email to