rahil-c opened a new issue, #18854: URL: https://github.com/apache/hudi/issues/18854
Part of #18676. RFC-104 / [design PR](https://github.com/chrevanthreddy/hudi/pull/1). ## Scope The glue layer that produces and commits the records to the new MDT partition — this is the meat of the milestone. Brings together sub-issues 1–4. ## Tasks - In `hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java`, implement: ```java Pair<Integer, HoodieData<HoodieRecord>> initializeVectorIndexPartition( HoodieIndexDefinition indexDef, Lazy<List<Pair<String, FileSlice>>> latestFileSlices) ``` Steps: 1. Read base files for `(recordKey, vectorColumn)` via Spark. 2. Call `SparkVectorIndexTrainer` (sub-issue 4) → centroids. 3. Broadcast centroids; map each row to its nearest centroid → `(recordKey, vector, clusterId)`. 4. Build MDT records via `HoodieMetadataPayload.createVectorIndexRecord` (sub-issue 2). 5. Return `(numClusters * fgPerCluster, HoodieData<HoodieRecord>)`. - In `hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java`: - Add `initializeVectorIndexPartition(...)` abstract hook. - Add `VECTOR_INDEX` switch case in `initializeFromFilesystem` (lines ~475–524) calling the abstract hook then `initializeFilegroupsAndCommit`. - Stub the Flink / Java client implementations with `UnsupportedOperationException` (RFC is explicitly Spark-first for now). ## Tests - Unit test with mocked `SparkVectorIndexTrainer` confirming the produced `HoodieData<HoodieRecord>` has the expected count and cluster distribution. - End-to-end coverage is in sub-issue 7. ## Depends on - Sub-issues 1, 2, 3, 4 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
