rahil-c opened a new issue, #18853: URL: https://github.com/apache/hudi/issues/18853
Part of #18676. RFC-104 / [design PR](https://github.com/chrevanthreddy/hudi/pull/1). ## Scope The IVF training step: produce K cluster centroids over the vector column of the data table. No MDT writes in this PR — pure training utility. ## Tasks - New class `org.apache.hudi.index.vector.SparkVectorIndexTrainer` in `hudi-client/hudi-spark-client`. - Inputs: dataset path (or `Dataset<Row>`), vector column name, `numClusters`, `trainingSampleSize`. - Implementation: - Reads the data table via Spark (parquet base files for now). - Samples up to `trainingSampleSize` vectors (`Dataset.sample(...)` with deterministic seed for reproducibility). - Converts the `array<float>` vector column to `org.apache.spark.ml.linalg.Vector` via `VectorAssembler` / UDF. - Fits `org.apache.spark.ml.clustering.KMeans` with configured `k`. - Returns `Dataset<Centroid>` (`clusterId: int, centroid: array<double>`). ## Tests - Unit test on synthetic dataset (e.g. 3 well-separated Gaussian blobs in R^16): assert `numClusters` centroids returned and KMeans inertia is materially lower than a random-centroid baseline. - Determinism test: same seed → same centroids. ## Depends on - None (pure utility, doesn't touch MDT) ## Out of scope Bootstrap orchestration, MDT writing, centroid persistence (deferred to a later sub-task). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
