Re: [PR] HDDS-14495. [Website v2] [Docs] [System Internals] OM Bootstrapping with Snapshots [ozone-site]

via GitHub Fri, 23 Jan 2026 04:18:46 -0800


sreejasahithi commented on code in PR #281:
URL: https://github.com/apache/ozone-site/pull/281#discussion_r2720956404



##########
docs/07-system-internals/07-features/08-om-bootstrapping-with-snapshots.md:
##########
@@ -0,0 +1,173 @@
+---
+sidebar_label: OM Bootstrapping with Snapshots
+---
+
+# OM Bootstrapping with Snapshots
+
+## Problem Statement
+
+The current bootstrapping mechanism for OM has inconsistencies when dealing 
with Snapshotted OM RocksDBs. Bootstrapping
+occurs without locking mechanisms, and active transactions may still modify 
snapshots RocksDB during the process.
+This can lead to a corrupted RocksDB instance on the follower OM 
post-bootstrapping. To resolve this, the bootstrapping
+process must operate on a consistent system state.
+
+Jira Ticket: [HDDS-12090](https://issues.apache.org/jira/browse/HDDS-12090)
+
+## Background on Snapshots
+
+### Snapshot Operations
+
+When a snapshot is taken on an Ozone bucket, the following steps occur:
+
+1. A RocksDB checkpoint of the active `om.db` is created.
+2. Deleted entries are removed from the `deletedKeyTable` and 
`deletedDirTable` in the Active Object Store (AOS) RocksDB.
+This is to just prevent the blocks from getting purged without checking for 
the key's presence in the correct snapshot in the snapshot chain.
+3. A new entry is added to the `snapshotInfoTable` in the AOS RocksDB.
+
+### Current Bootstrap Model
+
+The current model involves the follower OM initiating an HTTP request to the 
leader OM, which provides a consistent view of its state.
+Before bucket snapshots were introduced, this process relied solely on an AOS 
RocksDB checkpoint. However, with snapshots, multiple RocksDB
+instances (AOS RocksDB + snapshot RocksDBs) must be handled, complicating the 
process.
+
+#### Workflow
+
+- **Follower Initiation:**
+  - Sends an exclude list of files already copied in previous batches.
+- **Leader Actions:**
+  - Creates an AOS RocksDB checkpoint.
+  - Performs a directory walk through:
+    - AOS RocksDB checkpoint directory.
+    - Snapshot RocksDB directories.
+    - Backup SST file directory (compaction backup directory).
+  - Identifies unique files to be copied in the next batch.
+  - Transfers files in batches, recreating hardlinks on the follower side as 
needed.
+
+#### Issues with the Current Model
+
+1. Active transactions during bootstrapping may modify snapshot RocksDBs, 
leading to inconsistencies.
+2. Partial data copies can occur when double-buffer flushes or other 
snapshot-related operations are in progress.
+3. Large snapshot data sizes (often in GBs) require multi-batch transfers, 
increasing the risk of data corruption.
+
+## Proposed Fixes
+
+### Locking the Snapshot Cache
+
+Snapshot Cache is the class which is responsible for maintaining all RocksDB 
handles corresponding to a snapshot.
+The RocksDB handles are closed by the snapshot cache are closed from time to 
time if there are no references of the
+RocksDB being used by any of the threads in the system. Hence any operation on 
a snapshot would go through the snapshot
+cache increasing the reference count of that snapshot. Implementing a lock for 
this snapshot cache would prevent any newer
+threads from requesting a snapshot RocksDB handle from the snapshot cache. 
Thus any operation under this lock will have a
+consistent view of the entire snapshot. The only downside to this is that it 
would block the double buffer thread,
+hence any operation performed under this thread has to be lightweight so that 
it doesn't end up running for a long
+period of time. (P.S. With Sumit's implementation of optimized Gatekeeping 
model and getting rid of double buffer from
+OM would result in only blocking the snapshot operations which should be fine 
since these operations are only fired by background threads.)
+
+With the above implementation of a lock there is a way to get a consistent 
snapshot of the entire OM. Now lets dive into various approaches to overall 
bootstrap flow.
+
+### Approach 1 (Batching files over multiple tarballs)
+
+This approach builds the current model by introducing size thresholds to 
manage locks and data transfers more efficiently.
+
+#### Workflow
+
+1. **Follower Initiation:**
+   - Sends an exclude list of previously copied files (identified by 
`inodeId`).
+2. **Leader Directory Walk:**
+   - Walks through AOS RocksDB, snapshot RocksDBs, and backup SST directories 
to identify files to transfer.
+   - Compares against the exclude list to avoid duplicate transfers.
+3. If the total size of files to be copied is more than 
`ozone.om.ratis.snapshot.lock.max.total.size.threshold` then the
+files would be directly sent over the stream as a tarball where the name of 
the files is the inodeId of the file.
+4. If the total size of files to be copied is less than equal to 
`ozone.om.ratis.snapshot.lock.max.total.size.threshold`
+then the snapshot cache lock is taken after waiting for the snapshot cache to 
completely get empty (No snapshot RocksDB should be open). Under the lock 
following operations would be performed:
+   - Take the AOS RocksDB checkpoint.
+   - A complete directory walk is done on AOS checkpoint RocksDB directory + 
all the snapshot RocksDB directories + backup sst
+     file directory (compaction log directory) to figure out all the files to 
be copied and any file already present in the exclude list would be excluded.
+   - These files are added to the tarball where again the name of the file 
would be the inodeId of the file.
+5. As the files are being iterated the path of each file and their 
corresponding inodeIds would be tracked. When it is the
+last batch this map would also be written as a text file in the final tarball 
to recreate all the hardlinks on the follower node.
+
+#### Drawback
+
+The only drawback with this approach is that we might end up sending more data 
over the network because some sst files sent
+over the network could have been replaced because of compaction running 
concurrently on the active object store. But at the
+same time since the entire bootstrap operation is supposed to finish in the 
order of a few minutes, the amount of extra data
+would be really minimal assuming we could utmost write 30 MBs of data assuming 
there are 30000 keys written in 2 mins each
+key would be around 1 KB.
+
+### Approach 1.1
+
+This approach builds on the approach1 where along with introducing size 
thresholds under locks manage locks, we only rely
+on the number of files changed under the snapshot directory as the threshold.
+
+#### Workflow
+
+1. **Follower Initiation:**
+   - Sends an exclude list of previously copied files (identified by 
`inodeId`).
+2. **Leader Directory Walk:**
+   - Walks through AOS RocksDB, snapshot RocksDBs, and backup SST directories 
to identify files to transfer.
+   - Compares against the exclude list to avoid duplicate transfers.
+3. If either the total size to be copied or the total number of files under 
the snapshot RocksDB directory to be copied is
+more than `ozone.om.ratis.snapshot.max.total.sst.size` respectively then the 
files would be directly sent over the stream as
+a tarball where the name of the files is the inodeId of the file.
+4. If the total number of file size to be copied under the snapshot RocksDB 
directory is less than equal to `ozone.om.ratis.snapshot.max.total.sst.size`
+then the snapshot cache lock is taken after waiting for the snapshot cache to 
completely get empty (No snapshot RocksDB should be open).
+Under the lock following operations would be performed:
+   - Take the AOS RocksDB checkpoint.
+   - A complete directory walk is done on all the snapshot RocksDB directories 
to figure out all the files to be copied and any file already present in the 
exclude list would be excluded.
+   - Hard links of these files are added to tmp directory on the leader.
+   - Exit lock
+   - After the lock all files under the tmp directory, AOS RocksDB checkpoint 
directory and compaction backup directory have to be written to the tarball. As 
the files are being iterated the path of each file and their corresponding 
inodeIds would be tracked. Since this is the last batch this map would also be 
written as a text file in the final tarball to recreate all the hardlinks on 
the follower node.
+
+#### Drawback
+
+The drawbacks for this approach is the same as approach 1, but here we are 
optimizing on the amount of time lock is held

Review Comment:
   ```suggestion
   The drawbacks for this approach are the same as approach 1, but here we are 
optimizing on the amount of time lock is held
   ```



##########
docs/07-system-internals/07-features/08-om-bootstrapping-with-snapshots.md:
##########
@@ -0,0 +1,173 @@
+---
+sidebar_label: OM Bootstrapping with Snapshots
+---
+
+# OM Bootstrapping with Snapshots
+
+## Problem Statement
+
+The current bootstrapping mechanism for OM has inconsistencies when dealing 
with Snapshotted OM RocksDBs. Bootstrapping
+occurs without locking mechanisms, and active transactions may still modify 
snapshots RocksDB during the process.
+This can lead to a corrupted RocksDB instance on the follower OM 
post-bootstrapping. To resolve this, the bootstrapping
+process must operate on a consistent system state.
+
+Jira Ticket: [HDDS-12090](https://issues.apache.org/jira/browse/HDDS-12090)
+
+## Background on Snapshots
+
+### Snapshot Operations
+
+When a snapshot is taken on an Ozone bucket, the following steps occur:
+
+1. A RocksDB checkpoint of the active `om.db` is created.
+2. Deleted entries are removed from the `deletedKeyTable` and 
`deletedDirTable` in the Active Object Store (AOS) RocksDB.
+This is to just prevent the blocks from getting purged without checking for 
the key's presence in the correct snapshot in the snapshot chain.
+3. A new entry is added to the `snapshotInfoTable` in the AOS RocksDB.
+
+### Current Bootstrap Model
+
+The current model involves the follower OM initiating an HTTP request to the 
leader OM, which provides a consistent view of its state.
+Before bucket snapshots were introduced, this process relied solely on an AOS 
RocksDB checkpoint. However, with snapshots, multiple RocksDB
+instances (AOS RocksDB + snapshot RocksDBs) must be handled, complicating the 
process.
+
+#### Workflow
+
+- **Follower Initiation:**
+  - Sends an exclude list of files already copied in previous batches.
+- **Leader Actions:**
+  - Creates an AOS RocksDB checkpoint.
+  - Performs a directory walk through:
+    - AOS RocksDB checkpoint directory.
+    - Snapshot RocksDB directories.
+    - Backup SST file directory (compaction backup directory).
+  - Identifies unique files to be copied in the next batch.
+  - Transfers files in batches, recreating hardlinks on the follower side as 
needed.
+
+#### Issues with the Current Model
+
+1. Active transactions during bootstrapping may modify snapshot RocksDBs, 
leading to inconsistencies.
+2. Partial data copies can occur when double-buffer flushes or other 
snapshot-related operations are in progress.
+3. Large snapshot data sizes (often in GBs) require multi-batch transfers, 
increasing the risk of data corruption.
+
+## Proposed Fixes
+
+### Locking the Snapshot Cache
+
+Snapshot Cache is the class which is responsible for maintaining all RocksDB 
handles corresponding to a snapshot.
+The RocksDB handles are closed by the snapshot cache are closed from time to 
time if there are no references of the

Review Comment:
   ```suggestion
   The RocksDB handles are closed by the snapshot cache from time to time if 
there are no references of the
   ```



##########
docs/07-system-internals/07-features/08-om-bootstrapping-with-snapshots.md:
##########
@@ -0,0 +1,173 @@
+---
+sidebar_label: OM Bootstrapping with Snapshots
+---
+
+# OM Bootstrapping with Snapshots
+
+## Problem Statement
+
+The current bootstrapping mechanism for OM has inconsistencies when dealing 
with Snapshotted OM RocksDBs. Bootstrapping
+occurs without locking mechanisms, and active transactions may still modify 
snapshots RocksDB during the process.
+This can lead to a corrupted RocksDB instance on the follower OM 
post-bootstrapping. To resolve this, the bootstrapping
+process must operate on a consistent system state.
+
+Jira Ticket: [HDDS-12090](https://issues.apache.org/jira/browse/HDDS-12090)
+
+## Background on Snapshots
+
+### Snapshot Operations
+
+When a snapshot is taken on an Ozone bucket, the following steps occur:
+
+1. A RocksDB checkpoint of the active `om.db` is created.
+2. Deleted entries are removed from the `deletedKeyTable` and 
`deletedDirTable` in the Active Object Store (AOS) RocksDB.
+This is to just prevent the blocks from getting purged without checking for 
the key's presence in the correct snapshot in the snapshot chain.
+3. A new entry is added to the `snapshotInfoTable` in the AOS RocksDB.
+
+### Current Bootstrap Model
+
+The current model involves the follower OM initiating an HTTP request to the 
leader OM, which provides a consistent view of its state.
+Before bucket snapshots were introduced, this process relied solely on an AOS 
RocksDB checkpoint. However, with snapshots, multiple RocksDB
+instances (AOS RocksDB + snapshot RocksDBs) must be handled, complicating the 
process.
+
+#### Workflow
+
+- **Follower Initiation:**
+  - Sends an exclude list of files already copied in previous batches.
+- **Leader Actions:**
+  - Creates an AOS RocksDB checkpoint.
+  - Performs a directory walk through:
+    - AOS RocksDB checkpoint directory.
+    - Snapshot RocksDB directories.
+    - Backup SST file directory (compaction backup directory).
+  - Identifies unique files to be copied in the next batch.
+  - Transfers files in batches, recreating hardlinks on the follower side as 
needed.
+
+#### Issues with the Current Model
+
+1. Active transactions during bootstrapping may modify snapshot RocksDBs, 
leading to inconsistencies.
+2. Partial data copies can occur when double-buffer flushes or other 
snapshot-related operations are in progress.
+3. Large snapshot data sizes (often in GBs) require multi-batch transfers, 
increasing the risk of data corruption.
+
+## Proposed Fixes
+
+### Locking the Snapshot Cache
+
+Snapshot Cache is the class which is responsible for maintaining all RocksDB 
handles corresponding to a snapshot.
+The RocksDB handles are closed by the snapshot cache are closed from time to 
time if there are no references of the
+RocksDB being used by any of the threads in the system. Hence any operation on 
a snapshot would go through the snapshot
+cache increasing the reference count of that snapshot. Implementing a lock for 
this snapshot cache would prevent any newer
+threads from requesting a snapshot RocksDB handle from the snapshot cache. 
Thus any operation under this lock will have a
+consistent view of the entire snapshot. The only downside to this is that it 
would block the double buffer thread,
+hence any operation performed under this thread has to be lightweight so that 
it doesn't end up running for a long
+period of time. (P.S. With Sumit's implementation of optimized Gatekeeping 
model and getting rid of double buffer from
+OM would result in only blocking the snapshot operations which should be fine 
since these operations are only fired by background threads.)
+
+With the above implementation of a lock there is a way to get a consistent 
snapshot of the entire OM. Now lets dive into various approaches to overall 
bootstrap flow.
+
+### Approach 1 (Batching files over multiple tarballs)
+
+This approach builds the current model by introducing size thresholds to 
manage locks and data transfers more efficiently.
+
+#### Workflow
+
+1. **Follower Initiation:**
+   - Sends an exclude list of previously copied files (identified by 
`inodeId`).
+2. **Leader Directory Walk:**
+   - Walks through AOS RocksDB, snapshot RocksDBs, and backup SST directories 
to identify files to transfer.
+   - Compares against the exclude list to avoid duplicate transfers.
+3. If the total size of files to be copied is more than 
`ozone.om.ratis.snapshot.lock.max.total.size.threshold` then the
+files would be directly sent over the stream as a tarball where the name of 
the files is the inodeId of the file.
+4. If the total size of files to be copied is less than equal to 
`ozone.om.ratis.snapshot.lock.max.total.size.threshold`
+then the snapshot cache lock is taken after waiting for the snapshot cache to 
completely get empty (No snapshot RocksDB should be open). Under the lock 
following operations would be performed:
+   - Take the AOS RocksDB checkpoint.
+   - A complete directory walk is done on AOS checkpoint RocksDB directory + 
all the snapshot RocksDB directories + backup sst
+     file directory (compaction log directory) to figure out all the files to 
be copied and any file already present in the exclude list would be excluded.
+   - These files are added to the tarball where again the name of the file 
would be the inodeId of the file.
+5. As the files are being iterated the path of each file and their 
corresponding inodeIds would be tracked. When it is the
+last batch this map would also be written as a text file in the final tarball 
to recreate all the hardlinks on the follower node.
+
+#### Drawback
+
+The only drawback with this approach is that we might end up sending more data 
over the network because some sst files sent
+over the network could have been replaced because of compaction running 
concurrently on the active object store. But at the
+same time since the entire bootstrap operation is supposed to finish in the 
order of a few minutes, the amount of extra data
+would be really minimal assuming we could utmost write 30 MBs of data assuming 
there are 30000 keys written in 2 mins each
+key would be around 1 KB.
+
+### Approach 1.1
+
+This approach builds on the approach1 where along with introducing size 
thresholds under locks manage locks, we only rely
+on the number of files changed under the snapshot directory as the threshold.
+
+#### Workflow
+
+1. **Follower Initiation:**
+   - Sends an exclude list of previously copied files (identified by 
`inodeId`).
+2. **Leader Directory Walk:**
+   - Walks through AOS RocksDB, snapshot RocksDBs, and backup SST directories 
to identify files to transfer.
+   - Compares against the exclude list to avoid duplicate transfers.
+3. If either the total size to be copied or the total number of files under 
the snapshot RocksDB directory to be copied is
+more than `ozone.om.ratis.snapshot.max.total.sst.size` respectively then the 
files would be directly sent over the stream as
+a tarball where the name of the files is the inodeId of the file.
+4. If the total number of file size to be copied under the snapshot RocksDB 
directory is less than equal to `ozone.om.ratis.snapshot.max.total.sst.size`

Review Comment:
   ```suggestion
   4. If the total file size to be copied under the snapshot RocksDB directory 
is less than or equal to `ozone.om.ratis.snapshot.max.total.sst.size`
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-14495. [Website v2] [Docs] [System Internals] OM Bootstrapping with Snapshots [ozone-site]

Reply via email to