Wei-Chiu Chuang created HDDS-15010:
--------------------------------------
Summary: [Docs] System Internals: Ozone Manager RocksDB Schema
Key: HDDS-15010
URL: https://issues.apache.org/jira/browse/HDDS-15010
Project: Apache Ozone
Issue Type: Task
Components: documentation
Reporter: Wei-Chiu Chuang
✦ Ozone Manager RocksDB Schema
This document describes the internal RocksDB schema used by the Ozone Manager
(OM). The OM uses RocksDB to store all its metadata, including information
about volumes, buckets, keys, and snapshots.
Database Overview
- DB Name: om.db
- Location: Defined by ozone.om.db.dirs configuration.
- Backend: RocksDB with multiple Column Families (Tables).
Column Families (Tables)
The OM database is organized into several column families, categorized by
their function.
1. Hierarchy and Ownership Tables
These tables store the basic structure of the Ozone namespace.
┌─────────────┬────────────────────┬────────────────┬─────────────────────────────────────────────────────┐
│ Table Name │ Key Format │ Value Type │ Description
│
├─────────────┼────────────────────┼────────────────┼─────────────────────────────────────────────────────┤
│ userTable │ userName │ UserVolumeInfo │ Maps a user to a list
of volumes they own. │
│ volumeTable │ /\{volume} │ OmVolumeArgs │ Stores volume-level
metadata (owner, quota, ACLs). │
│ bucketTable │ /\{volume}/\{bucket} │ OmBucketInfo │ Stores bucket-level
metadata (layout, quota, ACLs). │
└─────────────┴────────────────────┴────────────────┴─────────────────────────────────────────────────────┘
2. Object Store (OBS) Tables
Used for buckets with LEGACY or OBJECT_STORE layouts. Keys are referenced by
their full path names.
┌──────────────┬─────────────────────────────────────┬───────────────────┬──────────────────────────────────────────────────────────┐
│ Table Name │ Key Format │ Value Type │
Description │
├──────────────┼─────────────────────────────────────┼───────────────────┼──────────────────────────────────────────────────────────┤
│ keyTable │ /\{volume}/\{bucket}/\{key} │ OmKeyInfo │
Metadata for committed keys, including block locations. │
│ openKeyTable │ /\{volume}/\{bucket}/\{key}/\{clientId} │ OmKeyInfo
│ Metadata for keys currently being written (uncommitted). │
│ deletedTable │ /\{volume}/\{bucket}/\{key} │ RepeatedOmKeyInfo │
Keys marked for deletion, pending garbage collection. │
└──────────────┴─────────────────────────────────────┴───────────────────┴──────────────────────────────────────────────────────────┘
3. File System Optimized (FSO) Tables
Used for buckets with FILE_SYSTEM_OPTIMIZED layout. These tables use a
hierarchical ID-based structure for better performance on directory operations
(like ls and rename).
┌───────────────────────┬────────────────────────────────────────────────────┬─────────────────┬─────────────────────────────────────────────┐
│ Table Name │ Key Format
│ Value Type │ Description │
├───────────────────────┼────────────────────────────────────────────────────┼─────────────────┼─────────────────────────────────────────────┤
│ directoryTable │ /\{volId}/\{buckId}/\{parentId}/\{dirName}
│ OmDirectoryInfo │ Metadata for directories. │
│ fileTable │ /\{volId}/\{buckId}/\{parentId}/\{fileName}
│ OmKeyInfo │ Metadata for committed files. │
│ openFileTable │
/\{volId}/\{buckId}/\{parentId}/\{fileName}/\{clientId} │ OmKeyInfo │
Metadata for files currently being written. │
│ deletedDirectoryTable │ /\{volId}/\{buckId}/\{parentId}/\{dirName}/\{objId}
│ OmKeyInfo │ Directories marked for deletion. │
└───────────────────────┴────────────────────────────────────────────────────┴─────────────────┴─────────────────────────────────────────────┘
4. Multipart Upload (MPU) Tables
Stores metadata for S3-style multipart uploads.
┌─────────────────────┬─────────────────────────────────────┬─────────────────────┬─────────────────────────────────────────┐
│ Table Name │ Key Format │ Value Type
│ Description │
├─────────────────────┼─────────────────────────────────────┼─────────────────────┼─────────────────────────────────────────┤
│ multipartInfoTable │ /\{volume}/\{bucket}/\{key}/\{uploadId} │
OmMultipartKeyInfo │ Overall MPU session metadata. │
│ multipartPartsTable │ \{uploadId}/\{partNumber} │
OmMultipartPartInfo │ Metadata for individual uploaded parts. │
└─────────────────────┴─────────────────────────────────────┴─────────────────────┴─────────────────────────────────────────┘
5. Snapshot Tables
Supports Ozone Snapshots and associated garbage collection.
┌──────────────────────┬───────────────────────────────────┬────────────────────┬───────────────────────────────────────────────────────────┐
│ Table Name │ Key Format │ Value Type
│ Description │
├──────────────────────┼───────────────────────────────────┼────────────────────┼───────────────────────────────────────────────────────────┤
│ snapshotInfoTable │ /\{volume}/\{bucket}/\{snapshotName} │ SnapshotInfo
│ Metadata for a specific snapshot. │
│ snapshotRenamedTable │ /\{volName}/\{buckName}/\{objectId} │ String
│ Tracks renamed objects between snapshots for correct GC. │
│ compactionLogTable │ \{dbTrxId}-\{compactionTime} │
CompactionLogEntry │ History of RocksDB compactions used by snapshot services. │
└──────────────────────┴───────────────────────────────────┴────────────────────┴───────────────────────────────────────────────────────────┘
6. Multi-Tenant and Security Tables
Manages S3 secrets and multi-tenancy state.
┌───────────────────────────┬───────────────┬───────────────────────┬────────────────────────────────────────────────────┐
│ Table Name │ Key Format │ Value Type │
Description │
├───────────────────────────┼───────────────┼───────────────────────┼────────────────────────────────────────────────────┤
│ tenantStateTable │ tenantId │ OmDBTenantState │ Tenant
configuration and state. │
│ tenantAccessIdTable │ accessId │ OmDBAccessIdInfo │ Maps
access ID to secret and tenant. │
│ principalToAccessIdsTable │ userPrincipal │ OmDBUserPrincipalInfo │ Maps a
Kerberos principal to a list of access IDs. │
│ s3SecretTable │ accessKeyId │ S3SecretValue │ Stores
S3 secrets for users. │
│ dTokenTable │ OzoneTokenID │ Long │
Delegation tokens and their renewal times. │
└───────────────────────────┴───────────────┴───────────────────────┴────────────────────────────────────────────────────┘
7. Administrative and System Tables
┌──────────────────────┬──────────────────┬─────────────────┬────────────────────────────────────────────────────────────────┐
│ Table Name │ Key Format │ Value Type │ Description
│
├──────────────────────┼──────────────────┼─────────────────┼────────────────────────────────────────────────────────────────┤
│ prefixTable │ prefix │ OmPrefixInfo │ Prefix-level
ACLs and metadata. │
│ transactionInfoTable │ #TRANSACTIONINFO │ TransactionInfo │ Stores the last
applied Ratis transaction ID and term. │
│ metaTable │ metaDataKey │ String │ Miscellaneous
system metadata (e.g., database layout version). │
└──────────────────────┴──────────────────┴─────────────────┴────────────────────────────────────────────────────────────────┘
Key Concepts
- FSO vs. OBS: The primary difference is how paths are stored. OBS uses
string concatenation of names, while FSO uses a chain of IDs (parentId).
- Object ID: A unique 64-bit identifier assigned to every object (volume,
bucket, key, directory). It is used as the parentId in FSO tables.
- OM Epoch: The most significant bits of Object IDs are often reserved for
an epoch to ensure uniqueness across OM restarts or migrations.
- Prefixes: Most keys in the hierarchy tables start with a leading slash (/)
as defined by OzoneConsts.OM_KEY_PREFIX.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]