This is an automated email from the ASF dual-hosted git repository.
vhs pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/master by this push:
new 74649c83045d docs: RFC-102 - Spark Vector Search in Apache Hudi
(#14218)
74649c83045d is described below
commit 74649c83045d801a61b4182b36814460f0eca6ad
Author: Rahil C <[email protected]>
AuthorDate: Fri Mar 13 11:24:07 2026 -0700
docs: RFC-102 - Spark Vector Search in Apache Hudi (#14218)
* docs: RFC-102 - Native Vector Search in Hudi
* update rfc with more information
* fix paths
* fix paths
* fix paths
* address vc comment
* address PR review comments on RFC-102
- Add "Spark" to RFC title per vinothchandar's request
- Rewrite abstract to explicitly mention Spark TVF
- Rename "What is a Vector Embedding?" section to "Background" per RFC
template
- Add external links for embedding background reading
- Add RFC-100 and RFC-99 references with proper links
- Fix broken heading ("### ##") for user experience section
- Move comparison references (Databricks, Snowflake, BigQuery) to Appendix
- Clarify tables A/B in batch search section with actual names
(users/products)
- Move Future Enhancements under Appendix with clear RFC-103 cross-reference
- Add note clarifying TVF will be transparently accelerated by future index
- Convert all Appendix reference links to descriptive markdown links
- Add BigQuery vector search overview link
---
rfc/rfc-102/cat_emebdding.png | Bin 0 -> 6075633 bytes
rfc/rfc-102/comparison_embedding.png | Bin 0 -> 6920384 bytes
rfc/rfc-102/embedding_table.png | Bin 0 -> 6452822 bytes
rfc/rfc-102/rfc-102.md | 227 +++++++++++++++++++++++++++++++++++
4 files changed, 227 insertions(+)
diff --git a/rfc/rfc-102/cat_emebdding.png b/rfc/rfc-102/cat_emebdding.png
new file mode 100644
index 000000000000..9639e5328f14
Binary files /dev/null and b/rfc/rfc-102/cat_emebdding.png differ
diff --git a/rfc/rfc-102/comparison_embedding.png
b/rfc/rfc-102/comparison_embedding.png
new file mode 100644
index 000000000000..7423c2be928c
Binary files /dev/null and b/rfc/rfc-102/comparison_embedding.png differ
diff --git a/rfc/rfc-102/embedding_table.png b/rfc/rfc-102/embedding_table.png
new file mode 100644
index 000000000000..65d028b73549
Binary files /dev/null and b/rfc/rfc-102/embedding_table.png differ
diff --git a/rfc/rfc-102/rfc-102.md b/rfc/rfc-102/rfc-102.md
new file mode 100644
index 000000000000..7eb22e0de6e9
--- /dev/null
+++ b/rfc/rfc-102/rfc-102.md
@@ -0,0 +1,227 @@
+<!--
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# RFC-102: Spark Batch Vector Search in Apache Hudi
+
+## Proposers
+
+- @rahil-c
+
+## Approvers
+
+- @vinothchandar
+- @balaji-varadarajan-ai
+- @yihua
+
+## Abstract
+
+This RFC adds a **Spark Table-Valued Function (TVF)** to perform efficient
batch vector similarity searches on Hudi tables.
+
+For the initial phase, we want to allow users to perform an offline batch
(Spark) vector search that will run a distributed k-nearest neighbors (KNN)
algorithm.
+Note this RFC will not cover **future enhancements** around a vector index to
speed up this search. These details will be discussed in a future RFC around
index creation and management: https://github.com/apache/hudi/pull/14255
+
+## Background
+
+A **vector embedding** is a fixed-dimensional numeric representation
(typically an array of `FLOAT`/`DOUBLE`/`INT`) produced by a model to capture
semantic properties of an object (text, image, audio, etc.). For more details
on embeddings, see:
+- [OpenAI Embeddings Guide](https://platform.openai.com/docs/guides/embeddings)
+- [Google Machine Learning -
Embeddings](https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture)
+
+For example:
+- An image encoder maps each image to a `d`-dimensional vector capturing
visual semantics.
+- A text encoder maps a sentence or document to a `d`-dimensional vector
capturing its meaning.
+
+Building on [RFC-100](https://github.com/apache/hudi/tree/master/rfc/rfc-100)
(unstructured data storage in Hudi), Hudi tables can contain unstructured data
columns (e.g., images, video, documents) and traditional structured data
columns, along with columns that store the related *embeddings* for those
contents.
+
+This RFC assumes that embeddings are already generated by an encoder/model,
and are already stored as numeric vectors in a Hudi table using the [VECTOR
logical type (RFC-99)](https://github.com/apache/hudi/tree/master/rfc/rfc-99),
+and focuses on how to **search** them given a new embedding.
+
+## How does KNN vector search work?
+* Given a **query embedding** (i.e, raw data like an image, text, video that
has now been transformed by an embedding model ),
+* Find the **K** most similar rows in a Hudi table,
+* Return those the rows which have the **smallest similarity score**.
+* A similarity score is calculated by using a **distance metric** such as
running one of the following **(cosine, dot product, L2)**
+
+
+#### Visual example using a simple table
+Consider a table that contains an **image/blob column** and a **vector
embedding column**. For simplicity, we'll use 3-dimensional embeddings (real
systems typically use dimensions much larger like > 1000
[https://developers.openai.com/api/docs/guides/embeddings/](https://developers.openai.com/api/docs/guides/embeddings/)):
+
+<img
src="https://raw.githubusercontent.com/rahil-c/hudi/rahil/rfc102-hudi-vector-search/rfc/rfc-102/embedding_table.png"
width="600" />
+
+Now suppose the user has a new image of a **cat (`cat3.jpg`)** and uses an
external model to generate its embedding:
+**- Query embedding for** ``cat3.jpg`: `q = [0.09, 0.19, 0.31]``
+
+<img
src="https://raw.githubusercontent.com/rahil-c/hudi/rahil/rfc102-hudi-vector-search/rfc/rfc-102/cat_emebdding.png"
width="600" />
+
+A typical query flow for this example would be:
+1. User calls a vector search function on the table and passes **`q`.**
+2. The function **reads all rows (or a filtered subset)** from the table.
+3. For each row, it extracts `image_embedding` and computes the distance to
``q` (for now we will use, **L2 Euclidean distance** to keep things simple
which runs the below for each row ):
+
+In three dimensions, at each index of the vector, the distance is computed by
running the following
+
+- `dist(q, cat1.jpg) ≈ 0.02`
+- `dist(q, cat2.jpg) ≈ 0.03`
+- `dist(q, dog1.jpg) ≈ 1.21`
+
+<img
src="https://raw.githubusercontent.com/rahil-c/hudi/rahil/rfc102-hudi-vector-search/rfc/rfc-102/comparison_embedding.png"
width="600" />
+
+4. It keeps the **K rows with the lowest distance (highest similarity)**. For
example, for `k = 2` the results might be:
+
+
+## User Experience
+
+We follow similar semantics as what other modern data systems offer for
performing vector search (see [Appendix: References](#appendix) for comparison
material).
+
+
+## Single query embedding search
+An initial start for how a hudi vector interface would look would be something
like this. *Note* In the future we can an index parameter, once we have a
vector index implementation for now though this has been omitted to avoid
confusion.
+```
+SELECT *
+FROM hudi_vector_search(
+table name or table_path => 'table' OR 's3://bucket/path/to/hudi/table',
+embedding_col => 'image_embedding',
+query_vector => ARRAY(0.12F, -0.03F, 0.81F, ...),
+k => 10,
+distance_metric => 'cosine'
+)
+);
+```
+Users can then chain other SQL operations on top of this such as performing
filters and join on the results. Here "products" is the table name and
"embedding" is the vector embedding column.
+
+```
+ // Vector search with WHERE clause filtering
+ val result = spark.sql(
+ s"""
+ |SELECT id, name, price, category, _distance
+ |FROM hudi_vector_search(
+ | 'products',
+ | 'embedding',
+ | ARRAY(1.0, 2.0, 3.0),
+ | 10
+ |)
+ |WHERE category = 'electronics' AND price < 100
+ |ORDER BY _distance
+ |""".stripMargin
+ ).collect()
+```
+
+```
+ // Vector search with JOIN
+ val result = spark.sql(
+ s"""
+ |SELECT vs.id, vs.name, c.category_name, vs._distance
+ |FROM hudi_vector_search(
+ | 'products',
+ | 'embedding',
+ | ARRAY(1.5, 2.5),
+ | 3
+ |) vs
+ |JOIN $categoriesTable c ON vs.category_id = c.category_id
+ |ORDER BY vs._distance
+ |""".stripMargin
+ ).collect()
+
+```
+
+## Batch of query embeddings search
+Say we have two tables that have their own vector embeddings columns (these
columns share some semantic meaning with one another). For example, a **users**
table with user preference embeddings and a **products** table with product
embeddings. The user should be able to compare the embeddings from the users
table against the embeddings of the products table.
+For each row in the users table, we return the nearest closest k elements from
the products table.
+```
+SELECT * FROM hudi_vector_search(
+ base_table => 'users',
+ base_col => 'user_preference_embedding',
+ query_table => 'products',
+ query_col => 'product_embedding',
+ k => 2,
+ distance_metric => 'cosine'
+);
+```
+We would return two of the nearest matches from the products table for each
user(for now only showing two users instead all the rows in the table)
+```
++----------+---------------------------+-------------+-------------------+--------------------+
+| users.id | user_preference_embedding | products.id | product_embedding |
distance |
++----------+---------------------------+-------------+-------------------+--------------------+
+| user_01 | 0.8 | laptop_99 | 0.85 |
0.0707 |
+| | 0.2 | | 0.15 |
|
++----------+---------------------------+-------------+-------------------+--------------------+
+| user_01 | 0.8 | mouse_42 | 0.70 |
0.2236 |
+| | 0.2 | | 0.40 |
|
++----------+---------------------------+-------------+-------------------+--------------------+
+| user_02 | -0.5 | book_11 | -0.60 |
0.1414 |
+| | 0.9 | | 0.80 |
|
++----------+---------------------------+-------------+-------------------+--------------------+
+| user_02 | -0.5 | kindle_88 | -0.30 |
0.2062 |
+| | 0.9 | | 0.95 |
|
++----------+---------------------------+-------------+-------------------+--------------------+
+...
+...
+...
+```
+
+## Understanding how distributed KNN Search would work?
+
+1. The lifecycle of a distributed KNN query begins at the coordinator/driver
node, which receives the query vector embedding and the desired number of
neighbors k. The driver broadcasts this q to all participating nodes/executors.
+
+2. Upon receiving q, each executor performs a linear scan of its received
slice of the data. To maintain performance, executors do not sort all local
results. Instead, they utilize a local priority queue—specifically a max-heap
of size k—to track the closest vectors encountered during their slice of the
data.
+
+3. As the executor iterates through its slice of the data, it calculates the
distance to q. If the calculated distance is smaller than the current maximum
distance in the heap (the k-th neighbor), the new vector replaces the previous
k-th neighbor, and the heap is re-balanced. This ensures that the memory
overhead on the executor remains constant regardless of the partition size, as
only k candidates are stored.
+
+4. Once all executors complete their local scans, they transmit their local
top-k heaps back to the coordinator. The driver is then tasked with a final
global merge. If the cluster consists of m nodes, the coordinator receives m×k
candidate vectors. The coordinator merges these sets into a final global top-k
list.
+
+
+
+## How can we improve brute force KNN to perform optimally within Spark/Hudi?
+
+The main question then becomes are the tricks we can do to parallelize this
work as efficiently as possible, such as having each worker node(executor) take
a N rows from the hudi table and perform the distance calculation on that file
group.
+We also want to make sure before we run the above that we do the following:
+* Ensure spark column projection of just the vector column provided by the
user in the SQL instead of reading all columns from the table
+* Ensure filters are applied before vector computation to cut number of rows,
otherwise we need to compare all rows in the table against the query embedding
+* Ensure each worker node is given a tableSlice(in this case file group), from
the file group we will get the latest file slice, and read all rows underlying
base parquet file (applying project and filters first) and perform the distance
calculation against the query row.
+
+## Appendix
+
+### Future Enhancements
+
+The next goal would be to integrate vector search into Hudi's indexing and
metadata capabilities, similar to how Hudi already uses the metadata table and
secondary indexes (e.g., Bloom, column stats) to accelerate queries.
+This RFC intentionally defers detailed index design to a dedicated
vector-index RFC ([RFC-103](https://github.com/apache/hudi/pull/14255)), but
discusses some ideas at a high level.
+
+Note: The current RFC focuses solely on the brute-force distributed KNN
approach via a Spark TVF. The TVF provides a clean user-facing API that can be
transparently accelerated by a vector index in the future without changing the
query interface.
+
+#### Vector Index Algorithms
+
+We anticipate supporting one or more of the following index families:
+
+##### IVF-Flat / IVF-PQ
+Partition vectors into coarse clusters, then search only a subset of clusters
for queries instead of doing a full table scan.
+
+##### HNSW (Hierarchical Navigable Small World)
+Graph-based approximate k-NN with good recall/latency trade-offs.
+
+Possible integration approaches:
+
+1. Metadata-table backed vector index
+2. Store index structures (coarse centroids, graph layers, posting lists,
etc.) as specialized index tables within Hudi's metadata table.
+
+### References
+* [RFC-99: VECTOR Logical
Type](https://github.com/apache/hudi/tree/master/rfc/rfc-99)
+* [Databricks -
vector_search](https://docs.databricks.com/aws/en/sql/language-manual/functions/vector_search)
+* [Snowflake - Vector
Embeddings](https://docs.snowflake.com/en/user-guide/snowflake-cortex/vector-embeddings)
+* [BigQuery -
VECTOR_SEARCH](https://docs.cloud.google.com/bigquery/docs/reference/standard-sql/search_functions#vector_search)
+* [BigQuery - Vector Search
Overview](https://docs.cloud.google.com/bigquery/docs/vector-search)
+* [pgvector](https://github.com/pgvector/pgvector)
+* [SurrealDB - Vector
Models](https://surrealdb.com/docs/surrealdb/models/vector)
+* [Milvus - Distributed Vector
Search](https://milvus.io/ai-quick-reference/in-a-distributed-vector-database-how-is-the-search-query-executed-across-multiple-machines-and-how-are-partial-results-merged-to-produce-the-final-nearest-neighbors-list)