prashantwason opened a new pull request, #18182:
URL: https://github.com/apache/hudi/pull/18182

   ### Describe the issue this Pull Request addresses
   
   Closes #18178
   
   Large-scale Hudi datasets with millions of records require many file groups 
(shards) in the Metadata Table (MDT), particularly for the Record Index 
partition. When all these file groups reside in a single directory, filesystems 
can hit per-directory file count limits. This PR introduces a bucketing 
strategy that organizes MDT file groups into sub-directories (buckets), 
enabling Hudi to scale to larger datasets.
   
   ### Summary and Changelog
   
   This feature allows file groups within metadata table partitions to be 
organized into sub-directories (buckets), enabling Hudi to scale to larger 
datasets without hitting per-directory file count limits.
   
   **Changes:**
   - Added new config `hoodie.metadata.file.group.bucketing.enable` (default: 
false) to enable bucketing for MDT partitions
   - Added new config `hoodie.metadata.file.group.bucket.size` (default: 1000) 
to configure number of file groups per bucket
   - Added table property `hoodie.metadata.partitions.bucketing.enable` to 
persist bucketing state
   - Modified `HoodieBackedTableMetadataWriter.initializeFileGroups()` to 
create file groups in bucket sub-directories when bucketing is enabled
   - Modified `HoodieTableMetadataUtil.getPartitionFileSlices()` and 
`getPartitionLatestFileSlicesIncludingInflight()` to iterate over bucket 
directories when bucketing is detected
   - Added helper methods: `getBucketRelativePath()`, 
`isBucketingEnabledForMDT()`, `getPartitionsToList()`, 
`setMetadataTablePartitionBucketing()`
   
   **New bucketed structure:**
   ```
   .hoodie/metadata/
   └── record_index/
       ├── 0000/
       │   ├── .hoodie_partition_metadata
       │   ├── record-index-0000_xxx.hfile
       │   └── ...                          # Up to 1000 file groups
       ├── 0001/
       │   ├── record-index-1000_xxx.hfile
       │   └── ...
       └── ...
   ```
   
   ### Impact
   
   - **New configs added**: `hoodie.metadata.file.group.bucketing.enable`, 
`hoodie.metadata.file.group.bucket.size`
   - **New table property**: `hoodie.metadata.partitions.bucketing.enable`
   - **Backward compatible**: Reader code auto-detects bucketed vs non-bucketed 
format without requiring configuration
   - **No breaking changes**: Feature is disabled by default
   
   ### Risk Level
   
   Low - Feature is disabled by default and only affects new MDT 
initializations when explicitly enabled. Reader code is backward compatible and 
auto-detects the format.
   
   ### Documentation Update
   
   The config descriptions are included in the code. Website documentation 
should be updated to describe:
   - The new bucketing feature and when to use it (large datasets with many 
record index shards)
   - The two new configuration options
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [ ] Adequate tests were added if applicable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to