vinothchandar commented on code in PR #14218: URL: https://github.com/apache/hudi/pull/14218#discussion_r2898272313
########## rfc/rfc-102/rfc-102.md: ########## @@ -0,0 +1,220 @@ +<!-- +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to You under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--> + +# RFC-102: Batch Vector Search in Apache Hudi + +## Proposers + +- @rahil-c + +## Approvers + +- @vinothchandar +- @balaji-varadarajan-ai +- @yihua + +## Abstract + +This RFC proposes adding the ability to perform a native **vector similarity search** on Hudi tables via Spark SQL. +For the initial phase, we want to allow users to perform an offline batch(Spark) vector search that will run a distributed k-nearest neighbors(knn) algorithm. +Note this RFC will not cover **future enhancements** around a vector index to speed up this search. These details will be discussed in a future RFC around index creation and management: https://github.com/apache/hudi/pull/14255 + +### What is a Vector Embedding? + +A **vector embedding** is a fixed-dimensional numeric representation (typically an array of `FLOAT`/`DOUBLE`/`INT`) produced by a model to capture semantic properties of an object (text, image, audio, etc.). For example: + +- An image encoder maps each image to a `d`-dimensional vector capturing visual semantics. +- A text encoder maps a sentence or document to a `d`-dimensional vector capturing its meaning. + +This RFC assumes that embeddings are already generated by an encoder/model, and are already stored as numeric vectors in a Hudi table using the VECTOR logical type https://github.com/apache/hudi/pull/18184/changes, +and focuses on how to **search** them given a new embedding. + +## Generally does this KNN vector search work for? +* Given a **query embedding** (i.e, raw data like an image, text, video that has now been transformed by an embedding model ), +* Find the **K** most similar rows in a Hudi table, +* Return those the rows which have the **smallest similarity score**. +* A similarity score is calculated by using a **distance metric** such as running one of the following **(cosine, dot product, L2)** + + +#### Visual example using a simple table +Consider a table that contains an **image/blob column** and a **vector embedding column**. For simplicity, we'll use 3-dimensional embeddings (real systems typically use dimensions much larger like > 1000 [https://developers.openai.com/api/docs/guides/embeddings/](https://developers.openai.com/api/docs/guides/embeddings/)): + +<img src="https://raw.githubusercontent.com/rahil-c/hudi/rahil/rfc102-hudi-vector-search/rfc/rfc-102/embedding_table.png" width="600" /> + +Now suppose the user has a new image of a **cat (`cat3.jpg`)** and uses an external model to generate its embedding: +**- Query embedding for** ``cat3.jpg`: `q = [0.09, 0.19, 0.31]`` + +<img src="https://raw.githubusercontent.com/rahil-c/hudi/rahil/rfc102-hudi-vector-search/rfc/rfc-102/cat_emebdding.png" width="600" /> + +A typical query flow for this example would be: +1. User calls a vector search function on the table and passes **`q`.** +2. The function **reads all rows (or a filtered subset)** from the table. +3. For each row, it extracts `image_embedding` and computes the distance to ``q` (for now we will use, **L2 Euclidean distance** to keep things simple which runs the below for each row ): + +In three dimensions, at each index of the vector, the distance is computed by running the following + +- `dist(q, cat1.jpg) ≈ 0.02` +- `dist(q, cat2.jpg) ≈ 0.03` +- `dist(q, dog1.jpg) ≈ 1.21` + +<img src="https://raw.githubusercontent.com/rahil-c/hudi/rahil/rfc102-hudi-vector-search/rfc/rfc-102/comparison_embedding.png" width="600" /> + +4. It keeps the **K rows with the lowest distance (highest similarity)**. For example, for `k = 2` the results might be: + + +### ## What is the typical user experience for initiating this search? Review Comment: lets be clear about what has to be in Appendix.. I think. research like this has to be in Appendix. anything reader needs to understand to read the design and implementation needs to be in background. ########## rfc/rfc-102/rfc-102.md: ########## @@ -0,0 +1,220 @@ +<!-- +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to You under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--> + +# RFC-102: Batch Vector Search in Apache Hudi + +## Proposers + +- @rahil-c + +## Approvers + +- @vinothchandar +- @balaji-varadarajan-ai +- @yihua + +## Abstract + +This RFC proposes adding the ability to perform a native **vector similarity search** on Hudi tables via Spark SQL. +For the initial phase, we want to allow users to perform an offline batch(Spark) vector search that will run a distributed k-nearest neighbors(knn) algorithm. +Note this RFC will not cover **future enhancements** around a vector index to speed up this search. These details will be discussed in a future RFC around index creation and management: https://github.com/apache/hudi/pull/14255 + +### What is a Vector Embedding? + +A **vector embedding** is a fixed-dimensional numeric representation (typically an array of `FLOAT`/`DOUBLE`/`INT`) produced by a model to capture semantic properties of an object (text, image, audio, etc.). For example: + +- An image encoder maps each image to a `d`-dimensional vector capturing visual semantics. +- A text encoder maps a sentence or document to a `d`-dimensional vector capturing its meaning. + +This RFC assumes that embeddings are already generated by an encoder/model, and are already stored as numeric vectors in a Hudi table using the VECTOR logical type https://github.com/apache/hudi/pull/18184/changes, +and focuses on how to **search** them given a new embedding. + +## Generally does this KNN vector search work for? +* Given a **query embedding** (i.e, raw data like an image, text, video that has now been transformed by an embedding model ), +* Find the **K** most similar rows in a Hudi table, +* Return those the rows which have the **smallest similarity score**. +* A similarity score is calculated by using a **distance metric** such as running one of the following **(cosine, dot product, L2)** + + +#### Visual example using a simple table +Consider a table that contains an **image/blob column** and a **vector embedding column**. For simplicity, we'll use 3-dimensional embeddings (real systems typically use dimensions much larger like > 1000 [https://developers.openai.com/api/docs/guides/embeddings/](https://developers.openai.com/api/docs/guides/embeddings/)): + +<img src="https://raw.githubusercontent.com/rahil-c/hudi/rahil/rfc102-hudi-vector-search/rfc/rfc-102/embedding_table.png" width="600" /> + +Now suppose the user has a new image of a **cat (`cat3.jpg`)** and uses an external model to generate its embedding: +**- Query embedding for** ``cat3.jpg`: `q = [0.09, 0.19, 0.31]`` + +<img src="https://raw.githubusercontent.com/rahil-c/hudi/rahil/rfc102-hudi-vector-search/rfc/rfc-102/cat_emebdding.png" width="600" /> + +A typical query flow for this example would be: +1. User calls a vector search function on the table and passes **`q`.** +2. The function **reads all rows (or a filtered subset)** from the table. +3. For each row, it extracts `image_embedding` and computes the distance to ``q` (for now we will use, **L2 Euclidean distance** to keep things simple which runs the below for each row ): + +In three dimensions, at each index of the vector, the distance is computed by running the following + +- `dist(q, cat1.jpg) ≈ 0.02` +- `dist(q, cat2.jpg) ≈ 0.03` +- `dist(q, dog1.jpg) ≈ 1.21` + +<img src="https://raw.githubusercontent.com/rahil-c/hudi/rahil/rfc102-hudi-vector-search/rfc/rfc-102/comparison_embedding.png" width="600" /> + +4. It keeps the **K rows with the lowest distance (highest similarity)**. For example, for `k = 2` the results might be: + + +### ## What is the typical user experience for initiating this search? + +We wanted to follow similar semantics as what other modern data systems offer for performing vector search. +See the following reference material here: +* https://docs.databricks.com/aws/en/sql/language-manual/functions/vector_search +* https://docs.snowflake.com/en/user-guide/snowflake-cortex/vector-embeddings +* https://docs.cloud.google.com/bigquery/docs/vector-search + + +## Single query embedding search +An initial start for how a hudi vector interface would look would be something like this. *Note* In the future we can an index parameter, once we have a vector index implementation for now though this has been omitted to avoid confusion. +``` +SELECT * +FROM hudi_vector_search( +table name or table_path => 'table' OR 's3://bucket/path/to/hudi/table', +embedding_col => 'image_embedding', +query_vector => ARRAY(0.12F, -0.03F, 0.81F, ...), +k => 10, +distance_metric => 'cosine' +) +); +``` +Users can then chain other SQL operations on top of this such as performing filters and join on the results. Here "products" is the table name and "embedding" is the vector embedding column. + +``` + // Vector search with WHERE clause filtering + val result = spark.sql( + s""" + |SELECT id, name, price, category, _distance + |FROM hudi_vector_search( + | 'products', + | 'embedding', + | ARRAY(1.0, 2.0, 3.0), + | 10 + |) + |WHERE category = 'electronics' AND price < 100 + |ORDER BY _distance + |""".stripMargin + ).collect() +``` + +``` + // Vector search with JOIN + val result = spark.sql( + s""" + |SELECT vs.id, vs.name, c.category_name, vs._distance + |FROM hudi_vector_search( + | 'products', + | 'embedding', + | ARRAY(1.5, 2.5), + | 3 + |) vs + |JOIN $categoriesTable c ON vs.category_id = c.category_id + |ORDER BY vs._distance + |""".stripMargin + ).collect() + +``` + +## Batch of query embeddings search +Say we have two tables that have their own vector embeddings columns(these columns share some semantic meaning with one another). The user should be able to compare the embeddings from tableA against the embeddings of tableB. +For each element in tableA, we can return back the nearest closest k elements from tableB. Review Comment: nit: write out what tables A and B are in example below? ########## rfc/rfc-102/rfc-102.md: ########## @@ -0,0 +1,220 @@ +<!-- +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to You under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--> + +# RFC-102: Batch Vector Search in Apache Hudi + +## Proposers + +- @rahil-c + +## Approvers + +- @vinothchandar +- @balaji-varadarajan-ai +- @yihua + +## Abstract + +This RFC proposes adding the ability to perform a native **vector similarity search** on Hudi tables via Spark SQL. +For the initial phase, we want to allow users to perform an offline batch(Spark) vector search that will run a distributed k-nearest neighbors(knn) algorithm. +Note this RFC will not cover **future enhancements** around a vector index to speed up this search. These details will be discussed in a future RFC around index creation and management: https://github.com/apache/hudi/pull/14255 + +### What is a Vector Embedding? Review Comment: Can we call this `Background` .. basically stick to the RFC template? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
