This is an automated email from the ASF dual-hosted git repository.

weichiu pushed a commit to branch HDDS-9225-website-v2
in repository https://gitbox.apache.org/repos/asf/ozone-site.git


The following commit(s) were added to refs/heads/HDDS-9225-website-v2 by this 
push:
     new 6f704aed7 HDDS-14395. [Website v2] [Docs] [Administrator Guide] 
RocksDB in Apache Ozone (#239)
6f704aed7 is described below

commit 6f704aed7a706e794051bb61db5ed64e9432a2ea
Author: Gargi Jaiswal <[email protected]>
AuthorDate: Wed Jan 14 23:48:37 2026 +0530

    HDDS-14395. [Website v2] [Docs] [Administrator Guide] RocksDB in Apache 
Ozone (#239)
---
 cspell.yaml                                        |   1 +
 .../02-configuration/04-performance/04-rocksdb.md  | 196 +++++++++++++++++++++
 2 files changed, 197 insertions(+)

diff --git a/cspell.yaml b/cspell.yaml
index d7f878876..4f7ba07d0 100644
--- a/cspell.yaml
+++ b/cspell.yaml
@@ -121,6 +121,7 @@ words:
 - Keytab
 - RocksDB
 - LDB
+- memtable
 - hsync
 - SASL
 - jira
diff --git 
a/docs/05-administrator-guide/02-configuration/04-performance/04-rocksdb.md 
b/docs/05-administrator-guide/02-configuration/04-performance/04-rocksdb.md
new file mode 100644
index 000000000..c259a69d7
--- /dev/null
+++ b/docs/05-administrator-guide/02-configuration/04-performance/04-rocksdb.md
@@ -0,0 +1,196 @@
+---
+sidebar_label: RocksDB In Apache Ozone
+---
+
+# RocksDB in Apache Ozone
+
+:::note
+This page covers advanced topics. Ozone administration normally does **not** 
require changing these settings.
+:::
+
+RocksDB is a critical component of Apache Ozone, providing a high-performance 
embedded key-value store. It is used by various Ozone services to persist 
metadata and state.
+
+## 1. Introduction to RocksDB
+
+RocksDB is a log-structured merge-tree (LSM-tree) based key-value store 
created by Facebook. It is optimized for fast storage environments like SSDs 
and offers high write throughput and efficient point lookups.  
+See the [RocksDB GitHub project](https://github.com/facebook/rocksdb) and the 
[RocksDB Wiki](https://github.com/facebook/rocksdb/wiki) for more details.
+
+## 2. How Ozone uses RocksDB
+
+RocksDB is utilized in the following Ozone components to store critical 
metadata:
+
+**Ozone Manager (OM):** The OM uses RocksDB as its primary metadata store, 
holding the entire namespace and related information. As defined in 
`OMDBDefinition.java`, this includes tables for:
+
+- **Namespace:**
+  - **Object Store Layout:** `volumeTable`, `bucketTable`, `keyTable` (for 
object store layout), `openKeyTable` (for tracking open keys during multipart 
uploads), and `multipartInfoTable` (for storing multipart upload information).
+  - **File System Layout:** `directoryTable` and `fileTable` (for file system 
layout), `openFileTable` (for tracking open files), and `deletedDirectoryTable` 
(for tracking deleted directories).
+  - **Access Control:** `prefixTable` (for storing prefix-based access control 
information).
+- **Security:** `userTable`, `dTokenTable` (delegation tokens), and 
`s3SecretTable`.
+- **S3 Multi-Tenancy:** `tenantStateTable` (for storing tenant state 
information), `tenantAccessIdTable` (for storing access ID information), and 
`principalToAccessIdsTable` (for mapping user principals to access IDs).
+- **State Management:**
+  - `transactionInfoTable` for tracking transactions.
+  - `metaTable` for storing miscellaneous metadata key-value pairs.
+  - `deletedTable` for pending key deletions.
+- **Snapshots:** `snapshotInfoTable` for managing Ozone snapshots, 
`snapshotRenamedTable` (for tracking renamed objects between snapshots), and 
`compactionLogTable` (for storing compaction log entries).
+
+**Storage Container Manager (SCM):** The SCM persists the state of the storage 
layer in RocksDB. The structure, defined in `SCMDBDefinition.java`, includes 
tables for:
+
+- `pipelines`: Manages the state and composition of data pipelines.
+- `containers`: Stores information about all storage containers in the cluster.
+- `deletedBlocks`: Tracks blocks that are marked for deletion and awaiting 
garbage collection.
+- `move`: Coordinates container movements for data rebalancing.
+- `validCerts`: Stores certificates for validating Datanodes.
+- `validSCMCerts`: Stores certificates for validating SCMs.
+- `scmTransactionInfos`: Tracks SCM transactions.
+- `sequenceId`: Manages sequence IDs for various SCM operations.
+- `meta`: Stores miscellaneous SCM metadata, including upgrade finalization 
status and metadata layout version.
+- `statefulServiceConfig`: Stores configurations for stateful services.
+
+**Datanode:** A Datanode utilizes RocksDB for two main purposes:
+
+- **Per-Volume Metadata:** It maintains one RocksDB instance per storage 
volume. Each of these instances manages metadata for the containers and blocks 
stored on that specific volume. As specified in 
`DatanodeSchemaThreeDBDefinition.java`, this database is structured with column 
families for `block_data`, `metadata`, `delete_txns`, `finalize_blocks`, and 
`last_chunk_info`. To optimize performance, it uses a fixed-length prefix based 
on the container ID, enabling efficient lookups with Ro [...]
+- **Global Container Tracking:** Additionally, each Datanode has a single, 
separate RocksDB instance to record the set of all containers it manages. This 
database, defined in `WitnessedContainerDBDefinition.java`, contains a 
`ContainerCreateInfoTable` table that provides a complete index of the 
containers hosted on that Datanode.
+
+**Recon:** Ozone's administration and monitoring tool, Recon, maintains its 
own RocksDB database to store aggregated and historical data for analysis. The 
`ReconDBDefinition.java` outlines tables for:
+
+- `containerKeyTable`: Maps containers to the keys they contain.
+- `namespaceSummaryTable`: Stores aggregated namespace information for quick 
reporting.
+- `replica_history`: Tracks the historical locations of container replicas, 
which is essential for auditing and diagnostics.
+- `keyContainerTable`: Maps keys to the containers they are in.
+- `containerKeyCountTable`: Stores the number of keys in each container.
+- `replica_history_v2`: Tracks the historical locations of container replicas 
with BCSID, which is essential for auditing and diagnostics.
+- `fileCountBySizeTable`: Stores file count statistics grouped by size ranges.
+- `globalStatsTable`: Stores global statistics for the Recon service.
+
+## 3. Tunings applicable to RocksDB
+
+Effective tuning of RocksDB can significantly impact Ozone's performance. 
Ozone exposes several configuration properties to tune RocksDB behavior. These 
properties are typically found in `ozone-default.xml` and can be overridden in 
`ozone-site.xml`.
+
+### General Settings
+
+Ozone provides a set of general RocksDB configurations that apply to all 
services (OM, SCM, and Datanodes) unless overridden by more specific settings. 
With the exception of `hdds.db.profile` and 
`ozone.metastore.rocksdb.cf.write.buffer.size`, these properties are defined in 
`RocksDBConfiguration.java`.
+
+| Property | Default | Description |
+|----------|---------|-------------|
+| `hdds.db.profile` | `DISK` | Specifies the RocksDB profile to use, which 
determines the default DBOptions and ColumnFamilyOptions. Possible values 
include `SSD` and `DISK`. For example, setting this to `SSD` will apply tunings 
optimized for SSD storage. |
+| `ozone.metastore.rocksdb.statistics` | `OFF` | The statistics level of the 
RocksDB store. If set to any value from `org.rocksdb.StatsLevel` (e.g., ALL or 
EXCEPT_DETAILED_TIMERS), RocksDB statistics will be exposed over JMX. Set to 
OFF to disable statistics collection. Note: collecting statistics can have a 
5–10% performance penalty. |
+
+**Write Options:**
+
+| Property | Default | Description |
+|----------|---------|-------------|
+| `hadoop.hdds.db.rocksdb.writeoption.sync` | `false` | If set to `true`, 
writes are synchronized to persistent storage, ensuring durability at the cost 
of performance. If `false`, writes are flushed asynchronously. |
+| `ozone.metastore.rocksdb.cf.write.buffer.size` | `128MB` | The write buffer 
(memtable) size for each column family of the RocksDB store. |
+
+**Write-Ahead Log (WAL) Management:**
+
+| Property | Default | Description |
+|----------|---------|-------------|
+| `hadoop.hdds.db.rocksdb.WAL_ttl_seconds` | `1200` | The time-to-live for WAL 
files in seconds. |
+| `hadoop.hdds.db.rocksdb.WAL_size_limit_MB` | `0` | The total size limit for 
WAL files in megabytes. When this limit is exceeded, the oldest WAL files are 
deleted. A value of `0` means no limit. |
+
+**Logging:**
+
+| Property | Default | Description |
+|----------|---------|-------------|
+| `hadoop.hdds.db.rocksdb.logging.enabled` | `false` | Enables or disables 
RocksDB's own logging. |
+| `hadoop.hdds.db.rocksdb.logging.level` | `INFO` | The logging level for 
RocksDB (INFO, DEBUG, WARN, ERROR, FATAL). |
+| `hadoop.hdds.db.rocksdb.max.log.file.size` | `100MB` | The maximum size of a 
single RocksDB log file. |
+| `hadoop.hdds.db.rocksdb.keep.log.file.num` | `10` | The maximum number of 
RocksDB log files to retain. |
+
+### Ozone Manager (OM) Specific Settings
+
+These settings, defined in `ozone-default.xml`, apply specifically to the 
Ozone Manager.
+
+| Property | Default                                                           
                        | Description |
+|----------|-------------------------------------------------------------------------------------------|-------------|
+| `ozone.om.db.max.open.files` | `-1` (unlimited)                              
                                            | The total number of files that a 
RocksDB can open in the OM. |
+| `ozone.om.compaction.service.enabled` | `false`                              
                                                     | Enable or disable a 
background job that periodically compacts RocksDB tables flagged for 
compaction. |
+| `ozone.om.compaction.service.run.interval` | `6h`                            
                                                          | The interval for 
the OM's compaction service. |
+| `ozone.om.compaction.service.timeout` | `10m`                                
                                                     | Timeout for the OM's 
compaction service. |
+| `ozone.om.compaction.service.columnfamilies` | `keyTable`<br 
/>`fileTable`<br />`directoryTable`<br />`deletedTable`<br 
/>`deletedDirectoryTable`<br />`multipartInfoTable` | A comma-separated list of 
column families to be compacted by the service. |
+
+### Datanode-Specific Settings
+
+These settings, defined in `DatanodeConfiguration.java`, apply specifically to 
Datanodes and will override the general settings where applicable.
+
+Key tuning parameters for the Datanode often involve:
+
+**Memory usage:** Configuring block cache, write buffer manager, and other 
memory-related settings.
+
+| Property | Default | Description |
+|----------|---------|-------------|
+| `hdds.datanode.metadata.rocksdb.cache.size` | `1GB` | Configures the block 
cache size for RocksDB instances on Datanodes. |
+
+**Compaction strategies:** Optimizing how data is merged and organized on 
disk. For more details, refer to the [Datanode Container Schema v3 in DN 
Documentation](../../../system-internals/components/datanode/rocksdb-schema/).
+
+| Property | Default | Description |
+|----------|---------|-------------|
+| `hdds.datanode.rocksdb.auto-compaction-small-sst-file` | `true` | Enables or 
disables auto-compaction for small SST files. |
+| `hdds.datanode.rocksdb.auto-compaction-small-sst-file-size-threshold` | 
`1MB` | Threshold for small SST file size for auto-compaction. |
+| `hdds.datanode.rocksdb.auto-compaction-small-sst-file-num-threshold` | `512` 
| Threshold for the number of small SST files for auto-compaction. |
+| `hdds.datanode.rocksdb.auto-compaction-small-sst-file.interval.minutes` | 
`120` | Auto compact small SST files interval in minutes. |
+| `hdds.datanode.rocksdb.auto-compaction-small-sst-file.threads` | `1` | Auto 
compact small SST files threads. |
+
+**Write-ahead log (WAL) settings:** Balancing durability and write performance.
+
+| Property | Default | Description |
+|----------|---------|-------------|
+| `hdds.datanode.rocksdb.log.max-file-size` | `32MB` | The max size of each 
user log file of RocksDB. O means no size limit. |
+| `hdds.datanode.rocksdb.log.max-file-num` | `64` | The max user log file 
number to keep for each RocksDB. |
+
+**Logging:**
+
+| Property | Default | Description |
+|----------|---------|-------------|
+| `hdds.datanode.rocksdb.log.level` | `INFO` | The user log level of 
RocksDB(DEBUG/INFO/WARN/ERROR/FATAL)). |
+
+**Other Settings:**
+
+| Property | Default | Description |
+|----------|--------|------------|
+| `hdds.datanode.db.config.path` | empty (not configured) | Path to an INI 
configuration file for advanced RocksDB tuning on Datanodes. |
+| `hdds.datanode.container.schema.v3.enabled` | `true` | Enable container 
schema v3 (one RocksDB per disk). |
+| `hdds.datanode.container.schema.v3.key.separator` | &#124; | The separator 
between Container ID and container meta key name in schema v3. |
+| `hdds.datanode.rocksdb.delete-obsolete-files-period` | `1h` | Periodicity 
when obsolete files get deleted. |
+| `hdds.datanode.rocksdb.max-open-files` | `1024` | The total number of files 
that a RocksDB can open. |
+
+## 4. Troubleshooting and repair tools relevant to RocksDB
+
+Troubleshooting RocksDB issues in Ozone often involves:
+
+- Analyzing RocksDB logs for errors and warnings.
+- Using RocksDB's built-in tools for inspecting database files:
+  - 
[ldb](https://github.com/facebook/rocksdb/wiki/Administration-and-Data-Access-Tool#ldb-tool):
 A command-line tool for inspecting and manipulating the contents of a RocksDB 
database.
+  - 
[sst_dump](https://github.com/facebook/rocksdb/wiki/Administration-and-Data-Access-Tool#sst-dump-tool):
 A command-line tool for inspecting the contents of SST (Static Table) files, 
which are the files that store the data in RocksDB.
+- Understanding common RocksDB error codes and their implications.
+
+## 5. Version compatibility
+
+Ozone 2.2.0 uses RocksDB **7.7.3**. It is recommended to use RocksDB tools of 
this version to ensure compatibility and avoid any potential issues.
+
+## 6. Monitoring and Metrics
+
+Monitoring RocksDB performance is crucial for maintaining a healthy Ozone 
cluster.
+
+- **RocksDB Statistics:** Ozone can expose detailed RocksDB statistics. Enable 
this by setting `ozone.metastore.rocksdb.statistics` to `ALL` or 
`EXCEPT_DETAILED_TIMERS` in `ozone-site.xml`. Be aware that enabling detailed 
statistics can incur a performance penalty (5-10%).
+- **Grafana Dashboards:** Ozone provides Grafana dashboards that visualize 
low-level RocksDB statistics. Refer to the [Ozone Monitoring 
Documentation](../../operations/observability/) for details on setting up 
monitoring and using these dashboards.
+
+## 7. Storage Sizing
+
+Properly sizing the storage for RocksDB instances is essential to prevent 
performance bottlenecks and out-of-disk errors. The requirements vary 
significantly for each Ozone component, and using dedicated, fast storage 
(SSDs) is highly recommended.
+
+**Ozone Manager (OM):**
+
+- **Baseline:** A minimum of **100 GB** should be reserved for the OM's 
RocksDB instance. The OM stores the entire namespace metadata (volumes, 
buckets, keys), so this is the most critical database in the cluster.
+- **With Snapshots:** Enabling Ozone Snapshots will substantially increase 
storage needs. Each snapshot preserves a view of the metadata, and the 
underlying data files (SSTs) cannot be deleted by compaction until a snapshot 
is removed. The exact requirement depends on the number of retained snapshots 
and the rate of change (creations/deletions) in the namespace. Monitor disk 
usage closely after enabling snapshots. For more details, refer to the [Ozone 
Snapshot Documentation](../../operat [...]
+
+**Storage Container Manager (SCM):**
+
+- SCM's metadata footprint (pipelines, containers, Datanode heartbeats) is 
much smaller than the OM's. A baseline of **20-50 GB** is typically sufficient 
for its RocksDB instance.
+
+**Datanode:**
+
+- The Datanode's RocksDB stores metadata for all containers and their blocks. 
Its size grows proportionally with the number of containers and blocks hosted 
on that Datanode.
+- **Rule of Thumb:** A good starting point is to reserve **0.1% to 0.5%** of 
the total data disk capacity for RocksDB metadata. For example, a Datanode with 
100 TB of data disks should reserve between 100 GB and 500 GB for its RocksDB 
metadata.
+- Workloads with many small files will result in a higher block count and will 
require space on the higher end of this range.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to