rahil-c opened a new issue, #18852:
URL: https://github.com/apache/hudi/issues/18852

   Part of #18676. RFC-104 / [design 
PR](https://github.com/chrevanthreddy/hudi/pull/1).
   
   ## Scope
   
   Records belonging to the same cluster must land in the same contiguous 
bucket of MDT file groups (cluster = a folder containing N files). This 
sub-task adds the mapping function used by the MDT writer.
   
   ## Tasks
   
   - Add `getVectorKeyToFileGroupMappingFunction(numClusters, fgPerCluster)` in 
`hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java`.
   - Key encoding: prefix the record key with the cluster ID, e.g. 
`C<hex(clusterId)>|<recordKey>`. Allows prefix scans per cluster at read time.
   - Mapping: `fileGroupIndex = (clusterId * fgPerCluster) + (hash(recordKey) % 
fgPerCluster)`.
   - Override `getFileGroupMappingFunction(HoodieIndexVersion)` on the 
`VECTOR_INDEX` enum in `MetadataPartitionType` so MDT routes records to the 
right file group.
   
   ## Tests
   
   - Unit test: insert many synthetic `(recordKey, clusterId)` tuples; assert 
all records for cluster `c` land in file groups `[c*fgPerCluster, 
(c+1)*fgPerCluster)`.
   - Unit test: varying `fgPerCluster` (1, 4, 16) — distribution of records 
within a cluster is roughly uniform across that cluster's file groups.
   
   ## Depends on
   
   - Sub-issue 1 (partition type registration)
   
   ## Out of scope
   
   Actual writing into the file groups — that happens in sub-issue 5.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to