[I] [Vector Index] KMeans training stage using Spark MLlib [hudi]

via GitHub Tue, 26 May 2026 11:11:56 -0700


rahil-c opened a new issue, #18853:
URL: https://github.com/apache/hudi/issues/18853


   Part of #18676. RFC-104 / [design 
PR](https://github.com/chrevanthreddy/hudi/pull/1).
   
   ## Scope
   
   The IVF training step: produce K cluster centroids over the vector column of 
the data table. No MDT writes in this PR — pure training utility.
   
   ## Tasks
   
   - New class `org.apache.hudi.index.vector.SparkVectorIndexTrainer` in 
`hudi-client/hudi-spark-client`.
   - Inputs: dataset path (or `Dataset<Row>`), vector column name, 
`numClusters`, `trainingSampleSize`.
   - Implementation:
     - Reads the data table via Spark (parquet base files for now).
     - Samples up to `trainingSampleSize` vectors (`Dataset.sample(...)` with 
deterministic seed for reproducibility).
     - Converts the `array<float>` vector column to 
`org.apache.spark.ml.linalg.Vector` via `VectorAssembler` / UDF.
     - Fits `org.apache.spark.ml.clustering.KMeans` with configured `k`.
     - Returns `Dataset<Centroid>` (`clusterId: int, centroid: array<double>`).
   
   ## Tests
   
   - Unit test on synthetic dataset (e.g. 3 well-separated Gaussian blobs in 
R^16): assert `numClusters` centroids returned and KMeans inertia is materially 
lower than a random-centroid baseline.
   - Determinism test: same seed → same centroids.
   
   ## Depends on
   
   - None (pure utility, doesn't touch MDT)
   
   ## Out of scope
   
   Bootstrap orchestration, MDT writing, centroid persistence (deferred to a 
later sub-task).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Vector Index] KMeans training stage using Spark MLlib [hudi]

Reply via email to