Wei-Chiu Chuang created HDDS-15010:
--------------------------------------

             Summary: [Docs] System Internals: Ozone Manager RocksDB Schema
                 Key: HDDS-15010
                 URL: https://issues.apache.org/jira/browse/HDDS-15010
             Project: Apache Ozone
          Issue Type: Task
          Components: documentation
            Reporter: Wei-Chiu Chuang


✦ Ozone Manager RocksDB Schema

  This document describes the internal RocksDB schema used by the Ozone Manager 
(OM). The OM uses RocksDB to store all its metadata, including information
  about volumes, buckets, keys, and snapshots.

  Database Overview

   - DB Name: om.db
   - Location: Defined by ozone.om.db.dirs configuration.
   - Backend: RocksDB with multiple Column Families (Tables).

  Column Families (Tables)

  The OM database is organized into several column families, categorized by 
their function.

  1. Hierarchy and Ownership Tables
  These tables store the basic structure of the Ozone namespace.


  
┌─────────────┬────────────────────┬────────────────┬─────────────────────────────────────────────────────┐
  │ Table Name  │ Key Format         │ Value Type     │ Description             
                            │
  
├─────────────┼────────────────────┼────────────────┼─────────────────────────────────────────────────────┤
  │ userTable   │ userName           │ UserVolumeInfo │ Maps a user to a list 
of volumes they own.          │
  │ volumeTable │ /\{volume}          │ OmVolumeArgs   │ Stores volume-level 
metadata (owner, quota, ACLs).  │
  │ bucketTable │ /\{volume}/\{bucket} │ OmBucketInfo   │ Stores bucket-level 
metadata (layout, quota, ACLs). │
  
└─────────────┴────────────────────┴────────────────┴─────────────────────────────────────────────────────┘


  2. Object Store (OBS) Tables
  Used for buckets with LEGACY or OBJECT_STORE layouts. Keys are referenced by 
their full path names.


  
┌──────────────┬─────────────────────────────────────┬───────────────────┬──────────────────────────────────────────────────────────┐
  │ Table Name   │ Key Format                          │ Value Type        │ 
Description                                              │
  
├──────────────┼─────────────────────────────────────┼───────────────────┼──────────────────────────────────────────────────────────┤
  │ keyTable     │ /\{volume}/\{bucket}/\{key}            │ OmKeyInfo         │ 
Metadata for committed keys, including block locations.  │
  │ openKeyTable │ /\{volume}/\{bucket}/\{key}/\{clientId} │ OmKeyInfo         
│ Metadata for keys currently being written (uncommitted). │
  │ deletedTable │ /\{volume}/\{bucket}/\{key}            │ RepeatedOmKeyInfo │ 
Keys marked for deletion, pending garbage collection.    │
  
└──────────────┴─────────────────────────────────────┴───────────────────┴──────────────────────────────────────────────────────────┘


  3. File System Optimized (FSO) Tables
  Used for buckets with FILE_SYSTEM_OPTIMIZED layout. These tables use a 
hierarchical ID-based structure for better performance on directory operations
  (like ls and rename).


  
┌───────────────────────┬────────────────────────────────────────────────────┬─────────────────┬─────────────────────────────────────────────┐
  │ Table Name            │ Key Format                                         
│ Value Type      │ Description                                 │
  
├───────────────────────┼────────────────────────────────────────────────────┼─────────────────┼─────────────────────────────────────────────┤
  │ directoryTable        │ /\{volId}/\{buckId}/\{parentId}/\{dirName}          
   │ OmDirectoryInfo │ Metadata for directories.                   │
  │ fileTable             │ /\{volId}/\{buckId}/\{parentId}/\{fileName}         
   │ OmKeyInfo       │ Metadata for committed files.               │
  │ openFileTable         │ 
/\{volId}/\{buckId}/\{parentId}/\{fileName}/\{clientId} │ OmKeyInfo       │ 
Metadata for files currently being written. │
  │ deletedDirectoryTable │ /\{volId}/\{buckId}/\{parentId}/\{dirName}/\{objId} 
    │ OmKeyInfo       │ Directories marked for deletion.            │
  
└───────────────────────┴────────────────────────────────────────────────────┴─────────────────┴─────────────────────────────────────────────┘


  4. Multipart Upload (MPU) Tables
  Stores metadata for S3-style multipart uploads.


  
┌─────────────────────┬─────────────────────────────────────┬─────────────────────┬─────────────────────────────────────────┐
  │ Table Name          │ Key Format                          │ Value Type      
    │ Description                             │
  
├─────────────────────┼─────────────────────────────────────┼─────────────────────┼─────────────────────────────────────────┤
  │ multipartInfoTable  │ /\{volume}/\{bucket}/\{key}/\{uploadId} │ 
OmMultipartKeyInfo  │ Overall MPU session metadata.           │
  │ multipartPartsTable │ \{uploadId}/\{partNumber}             │ 
OmMultipartPartInfo │ Metadata for individual uploaded parts. │
  
└─────────────────────┴─────────────────────────────────────┴─────────────────────┴─────────────────────────────────────────┘

  5. Snapshot Tables
  Supports Ozone Snapshots and associated garbage collection.


  
┌──────────────────────┬───────────────────────────────────┬────────────────────┬───────────────────────────────────────────────────────────┐
  │ Table Name           │ Key Format                        │ Value Type       
  │ Description                                               │
  
├──────────────────────┼───────────────────────────────────┼────────────────────┼───────────────────────────────────────────────────────────┤
  │ snapshotInfoTable    │ /\{volume}/\{bucket}/\{snapshotName} │ SnapshotInfo  
     │ Metadata for a specific snapshot.                         │
  │ snapshotRenamedTable │ /\{volName}/\{buckName}/\{objectId}  │ String        
     │ Tracks renamed objects between snapshots for correct GC.  │
  │ compactionLogTable   │ \{dbTrxId}-\{compactionTime}        │ 
CompactionLogEntry │ History of RocksDB compactions used by snapshot services. │
  
└──────────────────────┴───────────────────────────────────┴────────────────────┴───────────────────────────────────────────────────────────┘


  6. Multi-Tenant and Security Tables
  Manages S3 secrets and multi-tenancy state.


  
┌───────────────────────────┬───────────────┬───────────────────────┬────────────────────────────────────────────────────┐
  │ Table Name                │ Key Format    │ Value Type            │ 
Description                                        │
  
├───────────────────────────┼───────────────┼───────────────────────┼────────────────────────────────────────────────────┤
  │ tenantStateTable          │ tenantId      │ OmDBTenantState       │ Tenant 
configuration and state.                    │
  │ tenantAccessIdTable       │ accessId      │ OmDBAccessIdInfo      │ Maps 
access ID to secret and tenant.               │
  │ principalToAccessIdsTable │ userPrincipal │ OmDBUserPrincipalInfo │ Maps a 
Kerberos principal to a list of access IDs. │
  │ s3SecretTable             │ accessKeyId   │ S3SecretValue         │ Stores 
S3 secrets for users.                       │
  │ dTokenTable               │ OzoneTokenID  │ Long                  │ 
Delegation tokens and their renewal times.         │
  
└───────────────────────────┴───────────────┴───────────────────────┴────────────────────────────────────────────────────┘


  7. Administrative and System Tables


  
┌──────────────────────┬──────────────────┬─────────────────┬────────────────────────────────────────────────────────────────┐
  │ Table Name           │ Key Format       │ Value Type      │ Description     
                                               │
  
├──────────────────────┼──────────────────┼─────────────────┼────────────────────────────────────────────────────────────────┤
  │ prefixTable          │ prefix           │ OmPrefixInfo    │ Prefix-level 
ACLs and metadata.                                │
  │ transactionInfoTable │ #TRANSACTIONINFO │ TransactionInfo │ Stores the last 
applied Ratis transaction ID and term.         │
  │ metaTable            │ metaDataKey      │ String          │ Miscellaneous 
system metadata (e.g., database layout version). │
  
└──────────────────────┴──────────────────┴─────────────────┴────────────────────────────────────────────────────────────────┘

  Key Concepts

   - FSO vs. OBS: The primary difference is how paths are stored. OBS uses 
string concatenation of names, while FSO uses a chain of IDs (parentId).
   - Object ID: A unique 64-bit identifier assigned to every object (volume, 
bucket, key, directory). It is used as the parentId in FSO tables.
   - OM Epoch: The most significant bits of Object IDs are often reserved for 
an epoch to ensure uniqueness across OM restarts or migrations.
   - Prefixes: Most keys in the hierarchy tables start with a leading slash (/) 
as defined by OzoneConsts.OM_KEY_PREFIX.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to