prashantwason opened a new pull request, #18182:
URL: https://github.com/apache/hudi/pull/18182
### Describe the issue this Pull Request addresses
Closes #18178
Large-scale Hudi datasets with millions of records require many file groups
(shards) in the Metadata Table (MDT), particularly for the Record Index
partition. When all these file groups reside in a single directory, filesystems
can hit per-directory file count limits. This PR introduces a bucketing
strategy that organizes MDT file groups into sub-directories (buckets),
enabling Hudi to scale to larger datasets.
### Summary and Changelog
This feature allows file groups within metadata table partitions to be
organized into sub-directories (buckets), enabling Hudi to scale to larger
datasets without hitting per-directory file count limits.
**Changes:**
- Added new config `hoodie.metadata.file.group.bucketing.enable` (default:
false) to enable bucketing for MDT partitions
- Added new config `hoodie.metadata.file.group.bucket.size` (default: 1000)
to configure number of file groups per bucket
- Added table property `hoodie.metadata.partitions.bucketing.enable` to
persist bucketing state
- Modified `HoodieBackedTableMetadataWriter.initializeFileGroups()` to
create file groups in bucket sub-directories when bucketing is enabled
- Modified `HoodieTableMetadataUtil.getPartitionFileSlices()` and
`getPartitionLatestFileSlicesIncludingInflight()` to iterate over bucket
directories when bucketing is detected
- Added helper methods: `getBucketRelativePath()`,
`isBucketingEnabledForMDT()`, `getPartitionsToList()`,
`setMetadataTablePartitionBucketing()`
**New bucketed structure:**
```
.hoodie/metadata/
└── record_index/
├── 0000/
│ ├── .hoodie_partition_metadata
│ ├── record-index-0000_xxx.hfile
│ └── ... # Up to 1000 file groups
├── 0001/
│ ├── record-index-1000_xxx.hfile
│ └── ...
└── ...
```
### Impact
- **New configs added**: `hoodie.metadata.file.group.bucketing.enable`,
`hoodie.metadata.file.group.bucket.size`
- **New table property**: `hoodie.metadata.partitions.bucketing.enable`
- **Backward compatible**: Reader code auto-detects bucketed vs non-bucketed
format without requiring configuration
- **No breaking changes**: Feature is disabled by default
### Risk Level
Low - Feature is disabled by default and only affects new MDT
initializations when explicitly enabled. Reader code is backward compatible and
auto-detects the format.
### Documentation Update
The config descriptions are included in the code. Website documentation
should be updated to describe:
- The new bucketing feature and when to use it (large datasets with many
record index shards)
- The two new configuration options
### Contributor's checklist
- [x] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [x] Enough context is provided in the sections above
- [ ] Adequate tests were added if applicable
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]