rahil-c opened a new issue, #18855:
URL: https://github.com/apache/hudi/issues/18855

   Part of #18676. RFC-104 / [design 
PR](https://github.com/chrevanthreddy/hudi/pull/1).
   
   ## Scope
   
   User-facing entry point that triggers the bootstrap pipeline from sub-issue 
5.
   
   ## Tasks
   
   - Extend 
`hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/IndexCommands.scala`
 to recognize `vector_index` index type.
   - Extend 
`hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/HoodieSparkIndexClient.java`
 to:
     - Accept SQL options: `vectorColumn` (required), `numClusters` (optional, 
default from config), `fgPerCluster` (optional, default from config).
     - Validate the column exists on the table and is of type `array<float>` or 
`array<double>`.
     - Persist user-supplied params into `HoodieIndexDefinition` (so the 
bootstrap can read them back).
     - Invoke `ScheduleIndexActionExecutor` → metadata writer bootstrap path.
   
   Example DDL the change must support:
   
   ```sql
   CREATE INDEX my_vec_idx ON hudi_tbl
   USING vector_index (embedding)
   OPTIONS (numClusters = '128', fgPerCluster = '2');
   ```
   
   ## Tests
   
   - Negative test: missing `vectorColumn` option → clear error.
   - Negative test: non-array column → clear error.
   - Positive test: valid DDL parses and persists the `HoodieIndexDefinition` 
correctly.
   
   ## Depends on
   
   - Sub-issues 1, 5 (need partition type + bootstrap implementation)
   
   ## Out of scope
   
   `DROP INDEX` and `REFRESH INDEX` for vector indexes (later milestone).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to