[I] [FEATURE] Support Sub-directory Bucketing for Metadata Table Partitions to Overcome File Count Limits [hudi]

via GitHub Wed, 11 Feb 2026 13:12:44 -0800


prashantwason opened a new issue, #18178:
URL: https://github.com/apache/hudi/issues/18178


   ### Summary
   
   Large-scale Hudi datasets with millions of records require many file groups 
(shards) in the Metadata Table (MDT), particularly for the Record Index 
partition. When all these file groups reside in a single directory, filesystems 
can hit per-directory file count limits. This proposal introduces a bucketing 
strategy that organizes MDT file groups into sub-directories (buckets), 
enabling Hudi to scale to larger datasets and migrate to cloud storage systems 
that do not support file append operations.
   
   ---
   
   ### Problem Statement
   
   Currently, large datasets can have many Record Index shards (e.g., ~15,000+) 
which consist of HFiles (base files) and log files. All of these are located in 
a single partition directory (e.g., `record_index/`), which can exceed the file 
count limit that a single directory can handle on certain filesystems.
   
   **Current structure:**
   ```
   .hoodie/metadata/
   └── record_index/
       ├── .hoodie_partition_metadata
       ├── record-index-0000_xxx.hfile
       ├── record-index-0000_xxx.log
       ├── record-index-0001_xxx.hfile
       ├── ...
       └── record-index-14999_xxx.hfile   # ~15,000+ files in one directory
   ```
   
   **Issues with the current approach:**
   1. **Directory file count limits**: Some filesystems impose limits on the 
number of files per directory
   2. **Cloud storage compatibility**: Log append operations are used to 
minimize file creation, but append is not supported on cloud storage (S3, GCS, 
Azure Blob), blocking cloud migration for large datasets
   3. **Filesystem performance degradation**: Directory listing and file 
operations become slower with many files in a single directory
   
   ---
   
   ### Proposed Solution
   
   Introduce an intermediate bucketing layer within MDT partitions. File groups 
are distributed across numbered sub-directories (buckets), with a configurable 
number of file groups per bucket.
   
   **New bucketed structure:**
   ```
   .hoodie/metadata/
   └── record_index/
       ├── 0000/
       │   ├── .hoodie_partition_metadata
       │   ├── record-index-0000_xxx.hfile
       │   ├── record-index-0001_xxx.hfile
       │   └── ...                          # Up to 1000 file groups
       ├── 0001/
       │   ├── .hoodie_partition_metadata
       │   ├── record-index-1000_xxx.hfile
       │   ├── record-index-1001_xxx.hfile
       │   └── ...
       └── 0014/
           └── ...
   ```
   
   ---
   
   ### Design Details
   
   #### Configuration
   
   | Config Key | Default | Description |
   |------------|---------|-------------|
   | `hoodie.metadata.file.group.bucketing.enable` | `false` | Enable bucketing 
for MDT partitions. Only applicable when MDT or a new partition is initialized. 
|
   | `hoodie.metadata.file.group.bucket.size` | `1000` | Number of file groups 
(shards) per bucket |
   
   A table-level property `hoodie.metadata.partitions.bucketing.enable` is 
persisted in `hoodie.properties` to track whether bucketing is enabled for the 
MDT.
   
   #### Backward Compatibility
   
   - **Reader-side auto-detection**: The reader code automatically detects 
whether bucketing is enabled by checking for the presence of 
`.hoodie_partition_metadata` at the partition level vs bucket level
   - **No reader-side config required**: Existing readers continue to work 
without configuration changes
   - **Write-once**: Bucketing mode is set at MDT initialization; cannot be 
changed after MDT is created
   
   #### Key Implementation Changes
   
   1. **File Group Initialization** (`HoodieBackedTableMetadataWriter.java`):
      - When bucketing is enabled, file groups are created in bucket 
sub-directories
      - Bucket index is calculated as: `fileGroupIndex / bucketSize`
   
   2. **Partition Path Resolution** (`HoodieTableMetadataUtil.java`):
      - `getPartitionLatestFileSlices()` iterates over bucket directories when 
bucketing is detected
      - File slices are collected from all buckets and sorted by file ID
   
   3. **Record Key to Partition Mapping**:
      - When writing records with bucketing enabled, the partition path in 
`HoodieKey` is updated to include the bucket path
   
   ---
   
   ### Testing Plan
   
   1. **Unit Tests**: All existing MDT tests pass with bucketing enabled
   2. **Integration Testing**: 
      - New tables with bucketing enabled from scratch
      - Read/write operations on bucketed MDT
   
   ---
   
   ### Risks and Considerations
   
   #### Immutability of Bucketing Setting
   Once MDT is initialized with or without bucketing, the setting cannot be 
changed via config. Changing requires:
   - Deleting the MDT
   - Re-initializing with the desired bucketing setting
   
   ---
   
   ### Rollout Plan
   
   1. **Phase 1**: Feature flag disabled by default 
(`hoodie.metadata.file.group.bucketing.enable=false`)
   2. **Phase 2**: Enable for new tables hitting file count limits
   3. **Phase 3**: Consider making bucketing the default for new MDT 
initializations
   
   ---
   
   ### Related
   
   - **JIRA**: HUDI-5276
   - **Affects**: Record Index, Files partition, Column Stats, and other MDT 
partitions
   
   ---
   
   ### Acceptance Criteria
   
   - [ ] MDT file groups can be distributed across buckets when config is 
enabled
   - [ ] Reader code auto-detects bucketed vs non-bucketed format
   - [ ] All existing MDT unit tests pass
   - [ ] New unit tests for bucketing scenarios
   - [ ] Documentation for configuration and migration


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [FEATURE] Support Sub-directory Bucketing for Metadata Table Partitions to Overcome File Count Limits [hudi]

Reply via email to