This is an automated email from the ASF dual-hosted git repository.

vhs pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
     new 74649c83045d docs: RFC-102 - Spark Vector Search in Apache Hudi 
(#14218)
74649c83045d is described below

commit 74649c83045d801a61b4182b36814460f0eca6ad
Author: Rahil C <[email protected]>
AuthorDate: Fri Mar 13 11:24:07 2026 -0700

    docs: RFC-102 - Spark Vector Search in Apache Hudi (#14218)
    
    * docs: RFC-102 - Native Vector Search in Hudi
    
    * update rfc with more information
    
    * fix paths
    
    * fix paths
    
    * fix paths
    
    * address vc comment
    
    * address PR review comments on RFC-102
    
    - Add "Spark" to RFC title per vinothchandar's request
    - Rewrite abstract to explicitly mention Spark TVF
    - Rename "What is a Vector Embedding?" section to "Background" per RFC 
template
    - Add external links for embedding background reading
    - Add RFC-100 and RFC-99 references with proper links
    - Fix broken heading ("### ##") for user experience section
    - Move comparison references (Databricks, Snowflake, BigQuery) to Appendix
    - Clarify tables A/B in batch search section with actual names 
(users/products)
    - Move Future Enhancements under Appendix with clear RFC-103 cross-reference
    - Add note clarifying TVF will be transparently accelerated by future index
    - Convert all Appendix reference links to descriptive markdown links
    - Add BigQuery vector search overview link
---
 rfc/rfc-102/cat_emebdding.png        | Bin 0 -> 6075633 bytes
 rfc/rfc-102/comparison_embedding.png | Bin 0 -> 6920384 bytes
 rfc/rfc-102/embedding_table.png      | Bin 0 -> 6452822 bytes
 rfc/rfc-102/rfc-102.md               | 227 +++++++++++++++++++++++++++++++++++
 4 files changed, 227 insertions(+)

diff --git a/rfc/rfc-102/cat_emebdding.png b/rfc/rfc-102/cat_emebdding.png
new file mode 100644
index 000000000000..9639e5328f14
Binary files /dev/null and b/rfc/rfc-102/cat_emebdding.png differ
diff --git a/rfc/rfc-102/comparison_embedding.png 
b/rfc/rfc-102/comparison_embedding.png
new file mode 100644
index 000000000000..7423c2be928c
Binary files /dev/null and b/rfc/rfc-102/comparison_embedding.png differ
diff --git a/rfc/rfc-102/embedding_table.png b/rfc/rfc-102/embedding_table.png
new file mode 100644
index 000000000000..65d028b73549
Binary files /dev/null and b/rfc/rfc-102/embedding_table.png differ
diff --git a/rfc/rfc-102/rfc-102.md b/rfc/rfc-102/rfc-102.md
new file mode 100644
index 000000000000..7eb22e0de6e9
--- /dev/null
+++ b/rfc/rfc-102/rfc-102.md
@@ -0,0 +1,227 @@
+<!--
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# RFC-102: Spark Batch Vector Search in Apache Hudi
+
+## Proposers
+
+- @rahil-c
+
+## Approvers
+
+- @vinothchandar
+- @balaji-varadarajan-ai
+- @yihua
+
+## Abstract
+
+This RFC adds a **Spark Table-Valued Function (TVF)** to perform efficient 
batch vector similarity searches on Hudi tables.
+
+For the initial phase, we want to allow users to perform an offline batch 
(Spark) vector search that will run a distributed k-nearest neighbors (KNN) 
algorithm.
+Note this RFC will not cover **future enhancements** around a vector index to 
speed up this search. These details will be discussed in a future RFC around 
index creation and management: https://github.com/apache/hudi/pull/14255
+
+## Background
+
+A **vector embedding** is a fixed-dimensional numeric representation 
(typically an array of `FLOAT`/`DOUBLE`/`INT`) produced by a model to capture 
semantic properties of an object (text, image, audio, etc.). For more details 
on embeddings, see:
+- [OpenAI Embeddings Guide](https://platform.openai.com/docs/guides/embeddings)
+- [Google Machine Learning - 
Embeddings](https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture)
+
+For example:
+- An image encoder maps each image to a `d`-dimensional vector capturing 
visual semantics.
+- A text encoder maps a sentence or document to a `d`-dimensional vector 
capturing its meaning.
+
+Building on [RFC-100](https://github.com/apache/hudi/tree/master/rfc/rfc-100) 
(unstructured data storage in Hudi), Hudi tables can contain unstructured data 
columns (e.g., images, video, documents) and traditional structured data 
columns, along with columns that store the related *embeddings* for those 
contents.
+
+This RFC assumes that embeddings are already generated by an encoder/model, 
and are already stored as numeric vectors in a Hudi table using the [VECTOR 
logical type (RFC-99)](https://github.com/apache/hudi/tree/master/rfc/rfc-99),
+and focuses on how to **search** them given a new embedding.
+
+## How does KNN vector search work?
+*   Given a **query embedding** (i.e, raw data like an image, text, video that 
has now been transformed by an embedding model ),
+*   Find the **K** most similar rows in a Hudi table,
+*   Return those the rows which have the **smallest similarity score**.
+*   A similarity score is calculated by using a **distance metric** such as 
running one of the following **(cosine, dot product, L2)**
+
+
+#### Visual example using a simple table
+Consider a table that contains an **image/blob column** and a **vector 
embedding column**. For simplicity, we'll use 3-dimensional embeddings (real 
systems typically use dimensions much larger like > 1000 
[https://developers.openai.com/api/docs/guides/embeddings/](https://developers.openai.com/api/docs/guides/embeddings/)):
+
+<img 
src="https://raw.githubusercontent.com/rahil-c/hudi/rahil/rfc102-hudi-vector-search/rfc/rfc-102/embedding_table.png";
 width="600" />
+
+Now suppose the user has a new image of a **cat (`cat3.jpg`)** and uses an 
external model to generate its embedding:
+**- Query embedding for** ``cat3.jpg`: `q = [0.09, 0.19, 0.31]``
+
+<img 
src="https://raw.githubusercontent.com/rahil-c/hudi/rahil/rfc102-hudi-vector-search/rfc/rfc-102/cat_emebdding.png";
 width="600" />
+
+A typical query flow for this example would be:
+1. User calls a vector search function on the table and passes **`q`.**
+2. The function **reads all rows (or a filtered subset)** from the table.
+3. For each row, it extracts `image_embedding` and computes the distance to 
``q` (for now we will use, **L2 Euclidean distance** to keep things simple 
which runs the below for each row ):
+
+In three dimensions, at each index of the vector, the distance is computed by 
running the following
+
+- `dist(q, cat1.jpg) ≈ 0.02`
+- `dist(q, cat2.jpg) ≈ 0.03`
+- `dist(q, dog1.jpg) ≈ 1.21`
+
+<img 
src="https://raw.githubusercontent.com/rahil-c/hudi/rahil/rfc102-hudi-vector-search/rfc/rfc-102/comparison_embedding.png";
 width="600" />
+
+4. It keeps the **K rows with the lowest distance (highest similarity)**. For 
example, for `k = 2` the results might be:
+
+
+## User Experience
+
+We follow similar semantics as what other modern data systems offer for 
performing vector search (see [Appendix: References](#appendix) for comparison 
material).
+
+
+## Single query embedding search 
+An initial start for how a hudi vector interface would look would be something 
like this. *Note* In the future we can an index parameter, once we have a 
vector index implementation for now though this has been omitted to avoid 
confusion.
+```
+SELECT *
+FROM hudi_vector_search(
+table name or table_path => 'table' OR 's3://bucket/path/to/hudi/table',
+embedding_col   => 'image_embedding',
+query_vector    => ARRAY(0.12F, -0.03F, 0.81F, ...),
+k               => 10,
+distance_metric => 'cosine'
+)
+);
+```
+Users can then chain other SQL operations on top of this such as performing 
filters and join on the results. Here "products" is the table name and 
"embedding" is the vector embedding column.
+
+```
+    // Vector search with WHERE clause filtering
+      val result = spark.sql(
+        s"""
+           |SELECT id, name, price, category, _distance
+           |FROM hudi_vector_search(
+           |  'products',
+           |  'embedding',
+           |  ARRAY(1.0, 2.0, 3.0),
+           |  10
+           |)
+           |WHERE category = 'electronics' AND price < 100
+           |ORDER BY _distance
+           |""".stripMargin
+      ).collect()
+```
+
+```
+  // Vector search with JOIN
+      val result = spark.sql(
+        s"""
+           |SELECT vs.id, vs.name, c.category_name, vs._distance
+           |FROM hudi_vector_search(
+           |  'products',
+           |  'embedding',
+           |  ARRAY(1.5, 2.5),
+           |  3
+           |) vs
+           |JOIN $categoriesTable c ON vs.category_id = c.category_id
+           |ORDER BY vs._distance
+           |""".stripMargin
+      ).collect()
+
+```
+
+## Batch of query embeddings search
+Say we have two tables that have their own vector embeddings columns (these 
columns share some semantic meaning with one another). For example, a **users** 
table with user preference embeddings and a **products** table with product 
embeddings. The user should be able to compare the embeddings from the users 
table against the embeddings of the products table.
+For each row in the users table, we return the nearest closest k elements from 
the products table.
+```
+SELECT * FROM hudi_vector_search(
+  base_table      => 'users',
+  base_col        => 'user_preference_embedding',
+  query_table     => 'products',
+  query_col       => 'product_embedding',
+  k               => 2,
+  distance_metric => 'cosine'
+);
+```
+We would return two of the nearest matches from the products table for each 
user(for now only showing two users instead all the rows in the table)
+```
++----------+---------------------------+-------------+-------------------+--------------------+
+| users.id | user_preference_embedding | products.id | product_embedding | 
distance           |
++----------+---------------------------+-------------+-------------------+--------------------+
+| user_01  |  0.8                      | laptop_99   |  0.85             | 
0.0707             |
+|          |  0.2                      |             |  0.15             |     
               |
++----------+---------------------------+-------------+-------------------+--------------------+
+| user_01  |  0.8                      | mouse_42    |  0.70             | 
0.2236             |
+|          |  0.2                      |             |  0.40             |     
               |
++----------+---------------------------+-------------+-------------------+--------------------+
+| user_02  | -0.5                      | book_11     | -0.60             | 
0.1414             |
+|          |  0.9                      |             |  0.80             |     
               |
++----------+---------------------------+-------------+-------------------+--------------------+
+| user_02  | -0.5                      | kindle_88   | -0.30             | 
0.2062             |
+|          |  0.9                      |             |  0.95             |     
               |
++----------+---------------------------+-------------+-------------------+--------------------+
+...
+...
+...
+```
+
+## Understanding how distributed KNN Search would work?
+
+1. The lifecycle of a distributed KNN query begins at the coordinator/driver 
node, which receives the query vector embedding and the desired number of 
neighbors k. The driver broadcasts this q to all participating nodes/executors.
+
+2. Upon receiving q, each executor performs a linear scan of its received 
slice of the data. To maintain performance, executors do not sort all local 
results. Instead, they utilize a local priority queue—specifically a max-heap 
of size k—to track the closest vectors encountered during their slice of the 
data.
+
+3. As the executor iterates through its slice of the data, it calculates the 
distance to q. If the calculated distance is smaller than the current maximum 
distance in the heap (the k-th neighbor), the new vector replaces the previous 
k-th neighbor, and the heap is re-balanced. This ensures that the memory 
overhead on the executor remains constant regardless of the partition size, as 
only k candidates are stored.
+
+4. Once all executors complete their local scans, they transmit their local 
top-k heaps back to the coordinator. The driver is then tasked with a final 
global merge. If the cluster consists of m nodes, the coordinator receives m×k 
candidate vectors. The coordinator merges these sets into a final global top-k 
list.
+
+
+
+## How can we improve brute force KNN to perform optimally within Spark/Hudi?
+
+The main question then becomes are the tricks we can do to parallelize this 
work as efficiently as possible, such as having each worker node(executor) take 
a N rows from the hudi table and perform the distance calculation on that file 
group.
+We also want to make sure before we run the above that we do the following:
+* Ensure spark column projection of just the vector column provided by the 
user in the SQL instead of reading all columns from the table
+* Ensure filters are applied before vector computation to cut number of rows, 
otherwise we need to compare all rows in the table against the query embedding
+* Ensure each worker node is given a tableSlice(in this case file group), from 
the file group we will get the latest file slice, and read all rows underlying 
base parquet file (applying project and filters first) and perform the distance 
calculation against the query row.
+
+## Appendix
+
+### Future Enhancements
+
+The next goal would be to integrate vector search into Hudi's indexing and 
metadata capabilities, similar to how Hudi already uses the metadata table and 
secondary indexes (e.g., Bloom, column stats) to accelerate queries.
+This RFC intentionally defers detailed index design to a dedicated 
vector-index RFC ([RFC-103](https://github.com/apache/hudi/pull/14255)), but 
discusses some ideas at a high level.
+
+Note: The current RFC focuses solely on the brute-force distributed KNN 
approach via a Spark TVF. The TVF provides a clean user-facing API that can be 
transparently accelerated by a vector index in the future without changing the 
query interface.
+
+#### Vector Index Algorithms
+
+We anticipate supporting one or more of the following index families:
+
+##### IVF-Flat / IVF-PQ
+Partition vectors into coarse clusters, then search only a subset of clusters 
for queries instead of doing a full table scan.
+
+##### HNSW (Hierarchical Navigable Small World)
+Graph-based approximate k-NN with good recall/latency trade-offs.
+
+Possible integration approaches:
+
+1. Metadata-table backed vector index
+2. Store index structures (coarse centroids, graph layers, posting lists, 
etc.) as specialized index tables within Hudi's metadata table.
+
+### References
+* [RFC-99: VECTOR Logical 
Type](https://github.com/apache/hudi/tree/master/rfc/rfc-99)
+* [Databricks - 
vector_search](https://docs.databricks.com/aws/en/sql/language-manual/functions/vector_search)
+* [Snowflake - Vector 
Embeddings](https://docs.snowflake.com/en/user-guide/snowflake-cortex/vector-embeddings)
+* [BigQuery - 
VECTOR_SEARCH](https://docs.cloud.google.com/bigquery/docs/reference/standard-sql/search_functions#vector_search)
+* [BigQuery - Vector Search 
Overview](https://docs.cloud.google.com/bigquery/docs/vector-search)
+* [pgvector](https://github.com/pgvector/pgvector)
+* [SurrealDB - Vector 
Models](https://surrealdb.com/docs/surrealdb/models/vector)
+* [Milvus - Distributed Vector 
Search](https://milvus.io/ai-quick-reference/in-a-distributed-vector-database-how-is-the-search-query-executed-across-multiple-machines-and-how-are-partial-results-merged-to-produce-the-final-nearest-neighbors-list)

Reply via email to