rahil-c opened a new issue, #18852: URL: https://github.com/apache/hudi/issues/18852
Part of #18676. RFC-104 / [design PR](https://github.com/chrevanthreddy/hudi/pull/1). ## Scope Records belonging to the same cluster must land in the same contiguous bucket of MDT file groups (cluster = a folder containing N files). This sub-task adds the mapping function used by the MDT writer. ## Tasks - Add `getVectorKeyToFileGroupMappingFunction(numClusters, fgPerCluster)` in `hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java`. - Key encoding: prefix the record key with the cluster ID, e.g. `C<hex(clusterId)>|<recordKey>`. Allows prefix scans per cluster at read time. - Mapping: `fileGroupIndex = (clusterId * fgPerCluster) + (hash(recordKey) % fgPerCluster)`. - Override `getFileGroupMappingFunction(HoodieIndexVersion)` on the `VECTOR_INDEX` enum in `MetadataPartitionType` so MDT routes records to the right file group. ## Tests - Unit test: insert many synthetic `(recordKey, clusterId)` tuples; assert all records for cluster `c` land in file groups `[c*fgPerCluster, (c+1)*fgPerCluster)`. - Unit test: varying `fgPerCluster` (1, 4, 16) — distribution of records within a cluster is roughly uniform across that cluster's file groups. ## Depends on - Sub-issue 1 (partition type registration) ## Out of scope Actual writing into the file groups — that happens in sub-issue 5. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
