rahil-c opened a new issue, #18856: URL: https://github.com/apache/hudi/issues/18856
Part of #18676. RFC-104 / [design PR](https://github.com/chrevanthreddy/hudi/pull/1). ## Scope Prove the milestone-1 pipeline works end-to-end on Spark. ## Tasks - New Scala test `hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestVectorIndexBootstrap.scala`. - Test flow: 1. Write a small Hudi MOR table (~1k rows) with a `vector` column populated by synthetic embeddings drawn from K well-separated Gaussian clusters in R^32. 2. Run `CREATE INDEX vec_idx ON tbl USING vector_index (vector) OPTIONS (numClusters = 'K', fgPerCluster = '2')`. 3. Assertions: - MDT partition `vector_index_vec_idx` exists on disk. - MDT file-group count equals `K * fgPerCluster`. - Every base-table record key appears exactly once in the MDT partition. - Each MDT record's `clusterId` is in `[0, K)` and its `vector` field matches the base-table vector for that key. - Bonus assertion: KMeans recovered the synthetic clusters (centroid-to-truth nearest-neighbor distance below a threshold). ## Depends on - Sub-issues 1–6 (this is the integration test that lights up the whole milestone) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
