rahil-c opened a new issue, #18856:
URL: https://github.com/apache/hudi/issues/18856

   Part of #18676. RFC-104 / [design 
PR](https://github.com/chrevanthreddy/hudi/pull/1).
   
   ## Scope
   
   Prove the milestone-1 pipeline works end-to-end on Spark.
   
   ## Tasks
   
   - New Scala test 
`hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestVectorIndexBootstrap.scala`.
   - Test flow:
     1. Write a small Hudi MOR table (~1k rows) with a `vector` column 
populated by synthetic embeddings drawn from K well-separated Gaussian clusters 
in R^32.
     2. Run `CREATE INDEX vec_idx ON tbl USING vector_index (vector) OPTIONS 
(numClusters = 'K', fgPerCluster = '2')`.
     3. Assertions:
        - MDT partition `vector_index_vec_idx` exists on disk.
        - MDT file-group count equals `K * fgPerCluster`.
        - Every base-table record key appears exactly once in the MDT partition.
        - Each MDT record's `clusterId` is in `[0, K)` and its `vector` field 
matches the base-table vector for that key.
        - Bonus assertion: KMeans recovered the synthetic clusters 
(centroid-to-truth nearest-neighbor distance below a threshold).
   
   ## Depends on
   
   - Sub-issues 1–6 (this is the integration test that lights up the whole 
milestone)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to