[I] Add Support for Vector Indexing in Apache Hudi [hudi]

via GitHub Sun, 16 Nov 2025 22:02:50 -0800


suryaprasanna opened a new issue, #14290:
URL: https://github.com/apache/hudi/issues/14290


   ### Feature Description
   
   **What the feature achieves:**
   This feature introduces native support for creating, storing, and 
maintaining **vector indexes** within Hudi’s metadata table. It enables Hudi to 
index high-dimensional embedding vectors (e.g., product embeddings, user 
embeddings, text/image embeddings) and efficiently serve **Approximate Nearest 
Neighbor (ANN) search** over these vectors directly from Hudi-managed datasets.
   
   The feature provides:
   - Automatic vector index creation during metadata initialization
   - Horizontal scalability via clustering of embeddings into multiple indexes
   - Versioned vector index management using compaction
   - Consistent handling of inserts, updates, and deletes for embedding columns
   - File-group–aware storage so ingestion and compaction can scale 
independently
   - Native routing of ANN search queries to the latest index version
   
   Overall, the feature gives Hudi first-class support for vector search 
infrastructure at data-lake scale, fully integrated with its timeline, metadata 
table, and write operations.
   **Why this feature is needed:**
   Modern machine learning and AI workloads rely heavily on embeddings 
generated from models such as BERT, CLIP, multimodal LLMs, recommendation 
models, etc. These embeddings must be stored and queried efficiently to support 
use cases like:
   
   - Semantic search
   - Recommendation systems
   - Similarity-based retrieval
   - RAG (Retrieval-Augmented Generation) pipelines etc..
   
   Today, users must build and maintain external vector databases alongside 
Hudi datasets, which introduces:
   
   - Higher cost due to vertical scaling
   - Separate ingestion pipelines
   - Risk of inconsistency between main data and vector data
   - Extra infrastructure that must be scaled, monitored, and versioned
   - Data duplication
   
   By integrating vector index capabilities directly into Hudi:
   - Vector data remains consistent with the main table
   - ANN indexes are versioned and transactionally managed
   - Updates/deletes to vectors naturally follow the Hudi timeline
   - Index refresh occurs via compaction, leveraging existing orchestration
   - Users avoid maintaining a separate vector database system
   - Large-scale embeddings can be managed using Hudi’s file-group architecture
   
   This brings vector search capabilities into the data lake itself, 
eliminating the need for external systems and enabling end-to-end 
machine-learning data pipelines within Hudi.
   
   ### User Experience
   
   **How users will use this feature: (WIP)**
   - Configuration changes needed
   - API changes
   - Usage examples
   
   
   ### Hudi RFC Requirements
   
   **RFC PR link:** (if applicable)
   https://github.com/apache/hudi/pull/14255


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Add Support for Vector Indexing in Apache Hudi [hudi]

Reply via email to