Re: [PR] docs: RFC-102 - Spark Vector Search in Apache Hudi [hudi]

via GitHub Fri, 06 Mar 2026 15:03:21 -0800


vinothchandar commented on code in PR #14218:
URL: https://github.com/apache/hudi/pull/14218#discussion_r2898272313



##########
rfc/rfc-102/rfc-102.md:
##########
@@ -0,0 +1,220 @@
+<!--
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# RFC-102: Batch Vector Search in Apache Hudi
+
+## Proposers
+
+- @rahil-c
+
+## Approvers
+
+- @vinothchandar
+- @balaji-varadarajan-ai
+- @yihua
+
+## Abstract
+
+This RFC proposes adding the ability to perform a native **vector similarity 
search** on Hudi tables via Spark SQL.
+For the initial phase, we want to allow users to perform an offline 
batch(Spark) vector search that will run a distributed k-nearest neighbors(knn) 
algorithm.
+Note this RFC will not cover **future enhancements** around a vector index to 
speed up this search. These details will be discussed in a future RFC around 
index creation and management: https://github.com/apache/hudi/pull/14255
+
+### What is a Vector Embedding?
+
+A **vector embedding** is a fixed-dimensional numeric representation 
(typically an array of `FLOAT`/`DOUBLE`/`INT`) produced by a model to capture 
semantic properties of an object (text, image, audio, etc.). For example:
+
+- An image encoder maps each image to a `d`-dimensional vector capturing 
visual semantics.
+- A text encoder maps a sentence or document to a `d`-dimensional vector 
capturing its meaning.
+
+This RFC assumes that embeddings are already generated by an encoder/model, 
and are already stored as numeric vectors in a Hudi table using the VECTOR 
logical type https://github.com/apache/hudi/pull/18184/changes, 
+and focuses on how to **search** them given a new embedding.
+
+## Generally does this KNN vector search work for?
+*   Given a **query embedding** (i.e, raw data like an image, text, video that 
has now been transformed by an embedding model ),
+*   Find the **K** most similar rows in a Hudi table,
+*   Return those the rows which have the **smallest similarity score**.
+*   A similarity score is calculated by using a **distance metric** such as 
running one of the following **(cosine, dot product, L2)**
+
+
+#### Visual example using a simple table
+Consider a table that contains an **image/blob column** and a **vector 
embedding column**. For simplicity, we'll use 3-dimensional embeddings (real 
systems typically use dimensions much larger like > 1000 
[https://developers.openai.com/api/docs/guides/embeddings/](https://developers.openai.com/api/docs/guides/embeddings/)):
+
+<img 
src="https://raw.githubusercontent.com/rahil-c/hudi/rahil/rfc102-hudi-vector-search/rfc/rfc-102/embedding_table.png";
 width="600" />
+
+Now suppose the user has a new image of a **cat (`cat3.jpg`)** and uses an 
external model to generate its embedding:
+**- Query embedding for** ``cat3.jpg`: `q = [0.09, 0.19, 0.31]``
+
+<img 
src="https://raw.githubusercontent.com/rahil-c/hudi/rahil/rfc102-hudi-vector-search/rfc/rfc-102/cat_emebdding.png";
 width="600" />
+
+A typical query flow for this example would be:
+1. User calls a vector search function on the table and passes **`q`.**
+2. The function **reads all rows (or a filtered subset)** from the table.
+3. For each row, it extracts `image_embedding` and computes the distance to 
``q` (for now we will use, **L2 Euclidean distance** to keep things simple 
which runs the below for each row ):
+
+In three dimensions, at each index of the vector, the distance is computed by 
running the following
+
+- `dist(q, cat1.jpg) ≈ 0.02`
+- `dist(q, cat2.jpg) ≈ 0.03`
+- `dist(q, dog1.jpg) ≈ 1.21`
+
+<img 
src="https://raw.githubusercontent.com/rahil-c/hudi/rahil/rfc102-hudi-vector-search/rfc/rfc-102/comparison_embedding.png";
 width="600" />
+
+4. It keeps the **K rows with the lowest distance (highest similarity)**. For 
example, for `k = 2` the results might be:
+
+
+### ## What is the typical user experience for initiating this search?

Review Comment:
   lets be clear about what has to be in Appendix.. I think. research like this 
has to be in Appendix. anything reader needs to understand to read the design 
and implementation needs to be in background. 



##########
rfc/rfc-102/rfc-102.md:
##########
@@ -0,0 +1,220 @@
+<!--
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# RFC-102: Batch Vector Search in Apache Hudi
+
+## Proposers
+
+- @rahil-c
+
+## Approvers
+
+- @vinothchandar
+- @balaji-varadarajan-ai
+- @yihua
+
+## Abstract
+
+This RFC proposes adding the ability to perform a native **vector similarity 
search** on Hudi tables via Spark SQL.
+For the initial phase, we want to allow users to perform an offline 
batch(Spark) vector search that will run a distributed k-nearest neighbors(knn) 
algorithm.
+Note this RFC will not cover **future enhancements** around a vector index to 
speed up this search. These details will be discussed in a future RFC around 
index creation and management: https://github.com/apache/hudi/pull/14255
+
+### What is a Vector Embedding?
+
+A **vector embedding** is a fixed-dimensional numeric representation 
(typically an array of `FLOAT`/`DOUBLE`/`INT`) produced by a model to capture 
semantic properties of an object (text, image, audio, etc.). For example:
+
+- An image encoder maps each image to a `d`-dimensional vector capturing 
visual semantics.
+- A text encoder maps a sentence or document to a `d`-dimensional vector 
capturing its meaning.
+
+This RFC assumes that embeddings are already generated by an encoder/model, 
and are already stored as numeric vectors in a Hudi table using the VECTOR 
logical type https://github.com/apache/hudi/pull/18184/changes, 
+and focuses on how to **search** them given a new embedding.
+
+## Generally does this KNN vector search work for?
+*   Given a **query embedding** (i.e, raw data like an image, text, video that 
has now been transformed by an embedding model ),
+*   Find the **K** most similar rows in a Hudi table,
+*   Return those the rows which have the **smallest similarity score**.
+*   A similarity score is calculated by using a **distance metric** such as 
running one of the following **(cosine, dot product, L2)**
+
+
+#### Visual example using a simple table
+Consider a table that contains an **image/blob column** and a **vector 
embedding column**. For simplicity, we'll use 3-dimensional embeddings (real 
systems typically use dimensions much larger like > 1000 
[https://developers.openai.com/api/docs/guides/embeddings/](https://developers.openai.com/api/docs/guides/embeddings/)):
+
+<img 
src="https://raw.githubusercontent.com/rahil-c/hudi/rahil/rfc102-hudi-vector-search/rfc/rfc-102/embedding_table.png";
 width="600" />
+
+Now suppose the user has a new image of a **cat (`cat3.jpg`)** and uses an 
external model to generate its embedding:
+**- Query embedding for** ``cat3.jpg`: `q = [0.09, 0.19, 0.31]``
+
+<img 
src="https://raw.githubusercontent.com/rahil-c/hudi/rahil/rfc102-hudi-vector-search/rfc/rfc-102/cat_emebdding.png";
 width="600" />
+
+A typical query flow for this example would be:
+1. User calls a vector search function on the table and passes **`q`.**
+2. The function **reads all rows (or a filtered subset)** from the table.
+3. For each row, it extracts `image_embedding` and computes the distance to 
``q` (for now we will use, **L2 Euclidean distance** to keep things simple 
which runs the below for each row ):
+
+In three dimensions, at each index of the vector, the distance is computed by 
running the following
+
+- `dist(q, cat1.jpg) ≈ 0.02`
+- `dist(q, cat2.jpg) ≈ 0.03`
+- `dist(q, dog1.jpg) ≈ 1.21`
+
+<img 
src="https://raw.githubusercontent.com/rahil-c/hudi/rahil/rfc102-hudi-vector-search/rfc/rfc-102/comparison_embedding.png";
 width="600" />
+
+4. It keeps the **K rows with the lowest distance (highest similarity)**. For 
example, for `k = 2` the results might be:
+
+
+### ## What is the typical user experience for initiating this search?
+
+We wanted to follow similar semantics as what other modern data systems offer 
for performing vector search.
+See the following reference material here:
+* 
https://docs.databricks.com/aws/en/sql/language-manual/functions/vector_search
+* https://docs.snowflake.com/en/user-guide/snowflake-cortex/vector-embeddings
+* https://docs.cloud.google.com/bigquery/docs/vector-search
+
+
+## Single query embedding search 
+An initial start for how a hudi vector interface would look would be something 
like this. *Note* In the future we can an index parameter, once we have a 
vector index implementation for now though this has been omitted to avoid 
confusion.
+```
+SELECT *
+FROM hudi_vector_search(
+table name or table_path => 'table' OR 's3://bucket/path/to/hudi/table',
+embedding_col   => 'image_embedding',
+query_vector    => ARRAY(0.12F, -0.03F, 0.81F, ...),
+k               => 10,
+distance_metric => 'cosine'
+)
+);
+```
+Users can then chain other SQL operations on top of this such as performing 
filters and join on the results. Here "products" is the table name and 
"embedding" is the vector embedding column.
+
+```
+    // Vector search with WHERE clause filtering
+      val result = spark.sql(
+        s"""
+           |SELECT id, name, price, category, _distance
+           |FROM hudi_vector_search(
+           |  'products',
+           |  'embedding',
+           |  ARRAY(1.0, 2.0, 3.0),
+           |  10
+           |)
+           |WHERE category = 'electronics' AND price < 100
+           |ORDER BY _distance
+           |""".stripMargin
+      ).collect()
+```
+
+```
+  // Vector search with JOIN
+      val result = spark.sql(
+        s"""
+           |SELECT vs.id, vs.name, c.category_name, vs._distance
+           |FROM hudi_vector_search(
+           |  'products',
+           |  'embedding',
+           |  ARRAY(1.5, 2.5),
+           |  3
+           |) vs
+           |JOIN $categoriesTable c ON vs.category_id = c.category_id
+           |ORDER BY vs._distance
+           |""".stripMargin
+      ).collect()
+
+```
+
+## Batch of query embeddings search
+Say we have two tables that have their own vector embeddings columns(these 
columns share some semantic meaning with one another). The user should be able 
to compare the embeddings from tableA against the embeddings of tableB.
+For each element in tableA, we can return back the nearest closest k elements 
from tableB.

Review Comment:
   nit: write out what tables A and B are in example below?



##########
rfc/rfc-102/rfc-102.md:
##########
@@ -0,0 +1,220 @@
+<!--
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# RFC-102: Batch Vector Search in Apache Hudi
+
+## Proposers
+
+- @rahil-c
+
+## Approvers
+
+- @vinothchandar
+- @balaji-varadarajan-ai
+- @yihua
+
+## Abstract
+
+This RFC proposes adding the ability to perform a native **vector similarity 
search** on Hudi tables via Spark SQL.
+For the initial phase, we want to allow users to perform an offline 
batch(Spark) vector search that will run a distributed k-nearest neighbors(knn) 
algorithm.
+Note this RFC will not cover **future enhancements** around a vector index to 
speed up this search. These details will be discussed in a future RFC around 
index creation and management: https://github.com/apache/hudi/pull/14255
+
+### What is a Vector Embedding?

Review Comment:
   Can we call this `Background` .. basically stick to the RFC template?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] docs: RFC-102 - Spark Vector Search in Apache Hudi [hudi]

Reply via email to