[ 
https://issues.apache.org/jira/browse/HDDS-15010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDDS-15010:
----------------------------------
    Labels: pull-request-available  (was: )

> [Docs] System Internals: Ozone Manager RocksDB Schema
> -----------------------------------------------------
>
>                 Key: HDDS-15010
>                 URL: https://issues.apache.org/jira/browse/HDDS-15010
>             Project: Apache Ozone
>          Issue Type: Task
>          Components: documentation
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>            Priority: Major
>              Labels: pull-request-available
>
> ✦ Ozone Manager RocksDB Schema
>   This document describes the internal RocksDB schema used by the Ozone 
> Manager (OM). The OM uses RocksDB to store all its metadata, including 
> information
>   about volumes, buckets, keys, and snapshots.
>   Database Overview
>    - DB Name: om.db
>    - Location: Defined by ozone.om.db.dirs configuration.
>    - Backend: RocksDB with multiple Column Families (Tables).
>   Column Families (Tables)
>   The OM database is organized into several column families, categorized by 
> their function.
>   1. Hierarchy and Ownership Tables
>   These tables store the basic structure of the Ozone namespace.
>   
> ┌─────────────┬────────────────────┬────────────────┬─────────────────────────────────────────────────────┐
>   │ Table Name  │ Key Format         │ Value Type     │ Description           
>                               │
>   
> ├─────────────┼────────────────────┼────────────────┼─────────────────────────────────────────────────────┤
>   │ userTable   │ userName           │ UserVolumeInfo │ Maps a user to a list 
> of volumes they own.          │
>   │ volumeTable │ /\{volume}          │ OmVolumeArgs   │ Stores volume-level 
> metadata (owner, quota, ACLs).  │
>   │ bucketTable │ /\{volume}/\{bucket} │ OmBucketInfo   │ Stores bucket-level 
> metadata (layout, quota, ACLs). │
>   
> └─────────────┴────────────────────┴────────────────┴─────────────────────────────────────────────────────┘
>   2. Object Store (OBS) Tables
>   Used for buckets with LEGACY or OBJECT_STORE layouts. Keys are referenced 
> by their full path names.
>   
> ┌──────────────┬─────────────────────────────────────┬───────────────────┬──────────────────────────────────────────────────────────┐
>   │ Table Name   │ Key Format                          │ Value Type        │ 
> Description                                              │
>   
> ├──────────────┼─────────────────────────────────────┼───────────────────┼──────────────────────────────────────────────────────────┤
>   │ keyTable     │ /\{volume}/\{bucket}/\{key}            │ OmKeyInfo         
> │ Metadata for committed keys, including block locations.  │
>   │ openKeyTable │ /\{volume}/\{bucket}/\{key}/\{clientId} │ OmKeyInfo        
>  │ Metadata for keys currently being written (uncommitted). │
>   │ deletedTable │ /\{volume}/\{bucket}/\{key}            │ RepeatedOmKeyInfo 
> │ Keys marked for deletion, pending garbage collection.    │
>   
> └──────────────┴─────────────────────────────────────┴───────────────────┴──────────────────────────────────────────────────────────┘
>   3. File System Optimized (FSO) Tables
>   Used for buckets with FILE_SYSTEM_OPTIMIZED layout. These tables use a 
> hierarchical ID-based structure for better performance on directory operations
>   (like ls and rename).
>   
> ┌───────────────────────┬────────────────────────────────────────────────────┬─────────────────┬─────────────────────────────────────────────┐
>   │ Table Name            │ Key Format                                        
>  │ Value Type      │ Description                                 │
>   
> ├───────────────────────┼────────────────────────────────────────────────────┼─────────────────┼─────────────────────────────────────────────┤
>   │ directoryTable        │ /\{volId}/\{buckId}/\{parentId}/\{dirName}        
>      │ OmDirectoryInfo │ Metadata for directories.                   │
>   │ fileTable             │ /\{volId}/\{buckId}/\{parentId}/\{fileName}       
>      │ OmKeyInfo       │ Metadata for committed files.               │
>   │ openFileTable         │ 
> /\{volId}/\{buckId}/\{parentId}/\{fileName}/\{clientId} │ OmKeyInfo       │ 
> Metadata for files currently being written. │
>   │ deletedDirectoryTable │ 
> /\{volId}/\{buckId}/\{parentId}/\{dirName}/\{objId}     │ OmKeyInfo       │ 
> Directories marked for deletion.            │
>   
> └───────────────────────┴────────────────────────────────────────────────────┴─────────────────┴─────────────────────────────────────────────┘
>   4. Multipart Upload (MPU) Tables
>   Stores metadata for S3-style multipart uploads.
>   
> ┌─────────────────────┬─────────────────────────────────────┬─────────────────────┬─────────────────────────────────────────┐
>   │ Table Name          │ Key Format                          │ Value Type    
>       │ Description                             │
>   
> ├─────────────────────┼─────────────────────────────────────┼─────────────────────┼─────────────────────────────────────────┤
>   │ multipartInfoTable  │ /\{volume}/\{bucket}/\{key}/\{uploadId} │ 
> OmMultipartKeyInfo  │ Overall MPU session metadata.           │
>   │ multipartPartsTable │ \{uploadId}/\{partNumber}             │ 
> OmMultipartPartInfo │ Metadata for individual uploaded parts. │
>   
> └─────────────────────┴─────────────────────────────────────┴─────────────────────┴─────────────────────────────────────────┘
>   5. Snapshot Tables
>   Supports Ozone Snapshots and associated garbage collection.
>   
> ┌──────────────────────┬───────────────────────────────────┬────────────────────┬───────────────────────────────────────────────────────────┐
>   │ Table Name           │ Key Format                        │ Value Type     
>     │ Description                                               │
>   
> ├──────────────────────┼───────────────────────────────────┼────────────────────┼───────────────────────────────────────────────────────────┤
>   │ snapshotInfoTable    │ /\{volume}/\{bucket}/\{snapshotName} │ 
> SnapshotInfo       │ Metadata for a specific snapshot.                        
>  │
>   │ snapshotRenamedTable │ /\{volName}/\{buckName}/\{objectId}  │ String      
>        │ Tracks renamed objects between snapshots for correct GC.  │
>   │ compactionLogTable   │ \{dbTrxId}-\{compactionTime}        │ 
> CompactionLogEntry │ History of RocksDB compactions used by snapshot 
> services. │
>   
> └──────────────────────┴───────────────────────────────────┴────────────────────┴───────────────────────────────────────────────────────────┘
>   6. Multi-Tenant and Security Tables
>   Manages S3 secrets and multi-tenancy state.
>   
> ┌───────────────────────────┬───────────────┬───────────────────────┬────────────────────────────────────────────────────┐
>   │ Table Name                │ Key Format    │ Value Type            │ 
> Description                                        │
>   
> ├───────────────────────────┼───────────────┼───────────────────────┼────────────────────────────────────────────────────┤
>   │ tenantStateTable          │ tenantId      │ OmDBTenantState       │ 
> Tenant configuration and state.                    │
>   │ tenantAccessIdTable       │ accessId      │ OmDBAccessIdInfo      │ Maps 
> access ID to secret and tenant.               │
>   │ principalToAccessIdsTable │ userPrincipal │ OmDBUserPrincipalInfo │ Maps 
> a Kerberos principal to a list of access IDs. │
>   │ s3SecretTable             │ accessKeyId   │ S3SecretValue         │ 
> Stores S3 secrets for users.                       │
>   │ dTokenTable               │ OzoneTokenID  │ Long                  │ 
> Delegation tokens and their renewal times.         │
>   
> └───────────────────────────┴───────────────┴───────────────────────┴────────────────────────────────────────────────────┘
>   7. Administrative and System Tables
>   
> ┌──────────────────────┬──────────────────┬─────────────────┬────────────────────────────────────────────────────────────────┐
>   │ Table Name           │ Key Format       │ Value Type      │ Description   
>                                                  │
>   
> ├──────────────────────┼──────────────────┼─────────────────┼────────────────────────────────────────────────────────────────┤
>   │ prefixTable          │ prefix           │ OmPrefixInfo    │ Prefix-level 
> ACLs and metadata.                                │
>   │ transactionInfoTable │ #TRANSACTIONINFO │ TransactionInfo │ Stores the 
> last applied Ratis transaction ID and term.         │
>   │ metaTable            │ metaDataKey      │ String          │ Miscellaneous 
> system metadata (e.g., database layout version). │
>   
> └──────────────────────┴──────────────────┴─────────────────┴────────────────────────────────────────────────────────────────┘
>   Key Concepts
>    - FSO vs. OBS: The primary difference is how paths are stored. OBS uses 
> string concatenation of names, while FSO uses a chain of IDs (parentId).
>    - Object ID: A unique 64-bit identifier assigned to every object (volume, 
> bucket, key, directory). It is used as the parentId in FSO tables.
>    - OM Epoch: The most significant bits of Object IDs are often reserved for 
> an epoch to ensure uniqueness across OM restarts or migrations.
>    - Prefixes: Most keys in the hierarchy tables start with a leading slash 
> (/) as defined by OzoneConsts.OM_KEY_PREFIX.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to