[
https://issues.apache.org/jira/browse/HDDS-15010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HDDS-15010:
----------------------------------
Labels: pull-request-available (was: )
> [Docs] System Internals: Ozone Manager RocksDB Schema
> -----------------------------------------------------
>
> Key: HDDS-15010
> URL: https://issues.apache.org/jira/browse/HDDS-15010
> Project: Apache Ozone
> Issue Type: Task
> Components: documentation
> Reporter: Wei-Chiu Chuang
> Assignee: Wei-Chiu Chuang
> Priority: Major
> Labels: pull-request-available
>
> ✦ Ozone Manager RocksDB Schema
> This document describes the internal RocksDB schema used by the Ozone
> Manager (OM). The OM uses RocksDB to store all its metadata, including
> information
> about volumes, buckets, keys, and snapshots.
> Database Overview
> - DB Name: om.db
> - Location: Defined by ozone.om.db.dirs configuration.
> - Backend: RocksDB with multiple Column Families (Tables).
> Column Families (Tables)
> The OM database is organized into several column families, categorized by
> their function.
> 1. Hierarchy and Ownership Tables
> These tables store the basic structure of the Ozone namespace.
>
> ┌─────────────┬────────────────────┬────────────────┬─────────────────────────────────────────────────────┐
> │ Table Name │ Key Format │ Value Type │ Description
> │
>
> ├─────────────┼────────────────────┼────────────────┼─────────────────────────────────────────────────────┤
> │ userTable │ userName │ UserVolumeInfo │ Maps a user to a list
> of volumes they own. │
> │ volumeTable │ /\{volume} │ OmVolumeArgs │ Stores volume-level
> metadata (owner, quota, ACLs). │
> │ bucketTable │ /\{volume}/\{bucket} │ OmBucketInfo │ Stores bucket-level
> metadata (layout, quota, ACLs). │
>
> └─────────────┴────────────────────┴────────────────┴─────────────────────────────────────────────────────┘
> 2. Object Store (OBS) Tables
> Used for buckets with LEGACY or OBJECT_STORE layouts. Keys are referenced
> by their full path names.
>
> ┌──────────────┬─────────────────────────────────────┬───────────────────┬──────────────────────────────────────────────────────────┐
> │ Table Name │ Key Format │ Value Type │
> Description │
>
> ├──────────────┼─────────────────────────────────────┼───────────────────┼──────────────────────────────────────────────────────────┤
> │ keyTable │ /\{volume}/\{bucket}/\{key} │ OmKeyInfo
> │ Metadata for committed keys, including block locations. │
> │ openKeyTable │ /\{volume}/\{bucket}/\{key}/\{clientId} │ OmKeyInfo
> │ Metadata for keys currently being written (uncommitted). │
> │ deletedTable │ /\{volume}/\{bucket}/\{key} │ RepeatedOmKeyInfo
> │ Keys marked for deletion, pending garbage collection. │
>
> └──────────────┴─────────────────────────────────────┴───────────────────┴──────────────────────────────────────────────────────────┘
> 3. File System Optimized (FSO) Tables
> Used for buckets with FILE_SYSTEM_OPTIMIZED layout. These tables use a
> hierarchical ID-based structure for better performance on directory operations
> (like ls and rename).
>
> ┌───────────────────────┬────────────────────────────────────────────────────┬─────────────────┬─────────────────────────────────────────────┐
> │ Table Name │ Key Format
> │ Value Type │ Description │
>
> ├───────────────────────┼────────────────────────────────────────────────────┼─────────────────┼─────────────────────────────────────────────┤
> │ directoryTable │ /\{volId}/\{buckId}/\{parentId}/\{dirName}
> │ OmDirectoryInfo │ Metadata for directories. │
> │ fileTable │ /\{volId}/\{buckId}/\{parentId}/\{fileName}
> │ OmKeyInfo │ Metadata for committed files. │
> │ openFileTable │
> /\{volId}/\{buckId}/\{parentId}/\{fileName}/\{clientId} │ OmKeyInfo │
> Metadata for files currently being written. │
> │ deletedDirectoryTable │
> /\{volId}/\{buckId}/\{parentId}/\{dirName}/\{objId} │ OmKeyInfo │
> Directories marked for deletion. │
>
> └───────────────────────┴────────────────────────────────────────────────────┴─────────────────┴─────────────────────────────────────────────┘
> 4. Multipart Upload (MPU) Tables
> Stores metadata for S3-style multipart uploads.
>
> ┌─────────────────────┬─────────────────────────────────────┬─────────────────────┬─────────────────────────────────────────┐
> │ Table Name │ Key Format │ Value Type
> │ Description │
>
> ├─────────────────────┼─────────────────────────────────────┼─────────────────────┼─────────────────────────────────────────┤
> │ multipartInfoTable │ /\{volume}/\{bucket}/\{key}/\{uploadId} │
> OmMultipartKeyInfo │ Overall MPU session metadata. │
> │ multipartPartsTable │ \{uploadId}/\{partNumber} │
> OmMultipartPartInfo │ Metadata for individual uploaded parts. │
>
> └─────────────────────┴─────────────────────────────────────┴─────────────────────┴─────────────────────────────────────────┘
> 5. Snapshot Tables
> Supports Ozone Snapshots and associated garbage collection.
>
> ┌──────────────────────┬───────────────────────────────────┬────────────────────┬───────────────────────────────────────────────────────────┐
> │ Table Name │ Key Format │ Value Type
> │ Description │
>
> ├──────────────────────┼───────────────────────────────────┼────────────────────┼───────────────────────────────────────────────────────────┤
> │ snapshotInfoTable │ /\{volume}/\{bucket}/\{snapshotName} │
> SnapshotInfo │ Metadata for a specific snapshot.
> │
> │ snapshotRenamedTable │ /\{volName}/\{buckName}/\{objectId} │ String
> │ Tracks renamed objects between snapshots for correct GC. │
> │ compactionLogTable │ \{dbTrxId}-\{compactionTime} │
> CompactionLogEntry │ History of RocksDB compactions used by snapshot
> services. │
>
> └──────────────────────┴───────────────────────────────────┴────────────────────┴───────────────────────────────────────────────────────────┘
> 6. Multi-Tenant and Security Tables
> Manages S3 secrets and multi-tenancy state.
>
> ┌───────────────────────────┬───────────────┬───────────────────────┬────────────────────────────────────────────────────┐
> │ Table Name │ Key Format │ Value Type │
> Description │
>
> ├───────────────────────────┼───────────────┼───────────────────────┼────────────────────────────────────────────────────┤
> │ tenantStateTable │ tenantId │ OmDBTenantState │
> Tenant configuration and state. │
> │ tenantAccessIdTable │ accessId │ OmDBAccessIdInfo │ Maps
> access ID to secret and tenant. │
> │ principalToAccessIdsTable │ userPrincipal │ OmDBUserPrincipalInfo │ Maps
> a Kerberos principal to a list of access IDs. │
> │ s3SecretTable │ accessKeyId │ S3SecretValue │
> Stores S3 secrets for users. │
> │ dTokenTable │ OzoneTokenID │ Long │
> Delegation tokens and their renewal times. │
>
> └───────────────────────────┴───────────────┴───────────────────────┴────────────────────────────────────────────────────┘
> 7. Administrative and System Tables
>
> ┌──────────────────────┬──────────────────┬─────────────────┬────────────────────────────────────────────────────────────────┐
> │ Table Name │ Key Format │ Value Type │ Description
> │
>
> ├──────────────────────┼──────────────────┼─────────────────┼────────────────────────────────────────────────────────────────┤
> │ prefixTable │ prefix │ OmPrefixInfo │ Prefix-level
> ACLs and metadata. │
> │ transactionInfoTable │ #TRANSACTIONINFO │ TransactionInfo │ Stores the
> last applied Ratis transaction ID and term. │
> │ metaTable │ metaDataKey │ String │ Miscellaneous
> system metadata (e.g., database layout version). │
>
> └──────────────────────┴──────────────────┴─────────────────┴────────────────────────────────────────────────────────────────┘
> Key Concepts
> - FSO vs. OBS: The primary difference is how paths are stored. OBS uses
> string concatenation of names, while FSO uses a chain of IDs (parentId).
> - Object ID: A unique 64-bit identifier assigned to every object (volume,
> bucket, key, directory). It is used as the parentId in FSO tables.
> - OM Epoch: The most significant bits of Object IDs are often reserved for
> an epoch to ensure uniqueness across OM restarts or migrations.
> - Prefixes: Most keys in the hierarchy tables start with a leading slash
> (/) as defined by OzoneConsts.OM_KEY_PREFIX.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]