rahil-c opened a new issue, #18855: URL: https://github.com/apache/hudi/issues/18855
Part of #18676. RFC-104 / [design PR](https://github.com/chrevanthreddy/hudi/pull/1). ## Scope User-facing entry point that triggers the bootstrap pipeline from sub-issue 5. ## Tasks - Extend `hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/IndexCommands.scala` to recognize `vector_index` index type. - Extend `hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/HoodieSparkIndexClient.java` to: - Accept SQL options: `vectorColumn` (required), `numClusters` (optional, default from config), `fgPerCluster` (optional, default from config). - Validate the column exists on the table and is of type `array<float>` or `array<double>`. - Persist user-supplied params into `HoodieIndexDefinition` (so the bootstrap can read them back). - Invoke `ScheduleIndexActionExecutor` → metadata writer bootstrap path. Example DDL the change must support: ```sql CREATE INDEX my_vec_idx ON hudi_tbl USING vector_index (embedding) OPTIONS (numClusters = '128', fgPerCluster = '2'); ``` ## Tests - Negative test: missing `vectorColumn` option → clear error. - Negative test: non-array column → clear error. - Positive test: valid DDL parses and persists the `HoodieIndexDefinition` correctly. ## Depends on - Sub-issues 1, 5 (need partition type + bootstrap implementation) ## Out of scope `DROP INDEX` and `REFRESH INDEX` for vector indexes (later milestone). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
