prashantwason opened a new issue, #18178:
URL: https://github.com/apache/hudi/issues/18178
### Summary
Large-scale Hudi datasets with millions of records require many file groups
(shards) in the Metadata Table (MDT), particularly for the Record Index
partition. When all these file groups reside in a single directory, filesystems
can hit per-directory file count limits. This proposal introduces a bucketing
strategy that organizes MDT file groups into sub-directories (buckets),
enabling Hudi to scale to larger datasets and migrate to cloud storage systems
that do not support file append operations.
---
### Problem Statement
Currently, large datasets can have many Record Index shards (e.g., ~15,000+)
which consist of HFiles (base files) and log files. All of these are located in
a single partition directory (e.g., `record_index/`), which can exceed the file
count limit that a single directory can handle on certain filesystems.
**Current structure:**
```
.hoodie/metadata/
└── record_index/
├── .hoodie_partition_metadata
├── record-index-0000_xxx.hfile
├── record-index-0000_xxx.log
├── record-index-0001_xxx.hfile
├── ...
└── record-index-14999_xxx.hfile # ~15,000+ files in one directory
```
**Issues with the current approach:**
1. **Directory file count limits**: Some filesystems impose limits on the
number of files per directory
2. **Cloud storage compatibility**: Log append operations are used to
minimize file creation, but append is not supported on cloud storage (S3, GCS,
Azure Blob), blocking cloud migration for large datasets
3. **Filesystem performance degradation**: Directory listing and file
operations become slower with many files in a single directory
---
### Proposed Solution
Introduce an intermediate bucketing layer within MDT partitions. File groups
are distributed across numbered sub-directories (buckets), with a configurable
number of file groups per bucket.
**New bucketed structure:**
```
.hoodie/metadata/
└── record_index/
├── 0000/
│ ├── .hoodie_partition_metadata
│ ├── record-index-0000_xxx.hfile
│ ├── record-index-0001_xxx.hfile
│ └── ... # Up to 1000 file groups
├── 0001/
│ ├── .hoodie_partition_metadata
│ ├── record-index-1000_xxx.hfile
│ ├── record-index-1001_xxx.hfile
│ └── ...
└── 0014/
└── ...
```
---
### Design Details
#### Configuration
| Config Key | Default | Description |
|------------|---------|-------------|
| `hoodie.metadata.file.group.bucketing.enable` | `false` | Enable bucketing
for MDT partitions. Only applicable when MDT or a new partition is initialized.
|
| `hoodie.metadata.file.group.bucket.size` | `1000` | Number of file groups
(shards) per bucket |
A table-level property `hoodie.metadata.partitions.bucketing.enable` is
persisted in `hoodie.properties` to track whether bucketing is enabled for the
MDT.
#### Backward Compatibility
- **Reader-side auto-detection**: The reader code automatically detects
whether bucketing is enabled by checking for the presence of
`.hoodie_partition_metadata` at the partition level vs bucket level
- **No reader-side config required**: Existing readers continue to work
without configuration changes
- **Write-once**: Bucketing mode is set at MDT initialization; cannot be
changed after MDT is created
#### Key Implementation Changes
1. **File Group Initialization** (`HoodieBackedTableMetadataWriter.java`):
- When bucketing is enabled, file groups are created in bucket
sub-directories
- Bucket index is calculated as: `fileGroupIndex / bucketSize`
2. **Partition Path Resolution** (`HoodieTableMetadataUtil.java`):
- `getPartitionLatestFileSlices()` iterates over bucket directories when
bucketing is detected
- File slices are collected from all buckets and sorted by file ID
3. **Record Key to Partition Mapping**:
- When writing records with bucketing enabled, the partition path in
`HoodieKey` is updated to include the bucket path
---
### Testing Plan
1. **Unit Tests**: All existing MDT tests pass with bucketing enabled
2. **Integration Testing**:
- New tables with bucketing enabled from scratch
- Read/write operations on bucketed MDT
---
### Risks and Considerations
#### Immutability of Bucketing Setting
Once MDT is initialized with or without bucketing, the setting cannot be
changed via config. Changing requires:
- Deleting the MDT
- Re-initializing with the desired bucketing setting
---
### Rollout Plan
1. **Phase 1**: Feature flag disabled by default
(`hoodie.metadata.file.group.bucketing.enable=false`)
2. **Phase 2**: Enable for new tables hitting file count limits
3. **Phase 3**: Consider making bucketing the default for new MDT
initializations
---
### Related
- **JIRA**: HUDI-5276
- **Affects**: Record Index, Files partition, Column Stats, and other MDT
partitions
---
### Acceptance Criteria
- [ ] MDT file groups can be distributed across buckets when config is
enabled
- [ ] Reader code auto-detects bucketed vs non-bucketed format
- [ ] All existing MDT unit tests pass
- [ ] New unit tests for bucketing scenarios
- [ ] Documentation for configuration and migration
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]