[I] [Vector Index] Cluster assignment and bootstrap orchestration for MDT writes [hudi]

via GitHub Tue, 26 May 2026 11:11:56 -0700


rahil-c opened a new issue, #18854:
URL: https://github.com/apache/hudi/issues/18854


   Part of #18676. RFC-104 / [design 
PR](https://github.com/chrevanthreddy/hudi/pull/1).
   
   ## Scope
   
   The glue layer that produces and commits the records to the new MDT 
partition — this is the meat of the milestone. Brings together sub-issues 1–4.
   
   ## Tasks
   
   - In 
`hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java`,
 implement:
   
     ```java
     Pair<Integer, HoodieData<HoodieRecord>> initializeVectorIndexPartition(
         HoodieIndexDefinition indexDef,
         Lazy<List<Pair<String, FileSlice>>> latestFileSlices)
     ```
   
     Steps:
     1. Read base files for `(recordKey, vectorColumn)` via Spark.
     2. Call `SparkVectorIndexTrainer` (sub-issue 4) → centroids.
     3. Broadcast centroids; map each row to its nearest centroid → 
`(recordKey, vector, clusterId)`.
     4. Build MDT records via `HoodieMetadataPayload.createVectorIndexRecord` 
(sub-issue 2).
     5. Return `(numClusters * fgPerCluster, HoodieData<HoodieRecord>)`.
   
   - In 
`hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java`:
     - Add `initializeVectorIndexPartition(...)` abstract hook.
     - Add `VECTOR_INDEX` switch case in `initializeFromFilesystem` (lines 
~475–524) calling the abstract hook then `initializeFilegroupsAndCommit`.
   
   - Stub the Flink / Java client implementations with 
`UnsupportedOperationException` (RFC is explicitly Spark-first for now).
   
   ## Tests
   
   - Unit test with mocked `SparkVectorIndexTrainer` confirming the produced 
`HoodieData<HoodieRecord>` has the expected count and cluster distribution.
   - End-to-end coverage is in sub-issue 7.
   
   ## Depends on
   
   - Sub-issues 1, 2, 3, 4
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Vector Index] Cluster assignment and bootstrap orchestration for MDT writes [hudi]

Reply via email to