suryaprasanna opened a new issue, #14290: URL: https://github.com/apache/hudi/issues/14290
### Feature Description **What the feature achieves:** This feature introduces native support for creating, storing, and maintaining **vector indexes** within Hudi’s metadata table. It enables Hudi to index high-dimensional embedding vectors (e.g., product embeddings, user embeddings, text/image embeddings) and efficiently serve **Approximate Nearest Neighbor (ANN) search** over these vectors directly from Hudi-managed datasets. The feature provides: - Automatic vector index creation during metadata initialization - Horizontal scalability via clustering of embeddings into multiple indexes - Versioned vector index management using compaction - Consistent handling of inserts, updates, and deletes for embedding columns - File-group–aware storage so ingestion and compaction can scale independently - Native routing of ANN search queries to the latest index version Overall, the feature gives Hudi first-class support for vector search infrastructure at data-lake scale, fully integrated with its timeline, metadata table, and write operations. **Why this feature is needed:** Modern machine learning and AI workloads rely heavily on embeddings generated from models such as BERT, CLIP, multimodal LLMs, recommendation models, etc. These embeddings must be stored and queried efficiently to support use cases like: - Semantic search - Recommendation systems - Similarity-based retrieval - RAG (Retrieval-Augmented Generation) pipelines etc.. Today, users must build and maintain external vector databases alongside Hudi datasets, which introduces: - Higher cost due to vertical scaling - Separate ingestion pipelines - Risk of inconsistency between main data and vector data - Extra infrastructure that must be scaled, monitored, and versioned - Data duplication By integrating vector index capabilities directly into Hudi: - Vector data remains consistent with the main table - ANN indexes are versioned and transactionally managed - Updates/deletes to vectors naturally follow the Hudi timeline - Index refresh occurs via compaction, leveraging existing orchestration - Users avoid maintaining a separate vector database system - Large-scale embeddings can be managed using Hudi’s file-group architecture This brings vector search capabilities into the data lake itself, eliminating the need for external systems and enabling end-to-end machine-learning data pipelines within Hudi. ### User Experience **How users will use this feature: (WIP)** - Configuration changes needed - API changes - Usage examples ### Hudi RFC Requirements **RFC PR link:** (if applicable) https://github.com/apache/hudi/pull/14255 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
