rahil-c commented on code in PR #14255: URL: https://github.com/apache/hudi/pull/14255#discussion_r2523901924
########## rfc/rfc-103/rfc-103.md: ########## @@ -0,0 +1,178 @@ + <!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +# RFC-101: Support Vector Index on Hudi + +## Proposers + +- @suryaprasanna +- @prashantwason + +## Approvers +- @vinoth +- @rahil-c + +## Status + + +## Abstract +As LLM applications are on the rise, a lot of focus has been on “operational” or “online” databases, that have added vector search capabilities or specialized vector databases (Chroma, Pinecone, ..), which offer similar capabilities. Specialized vector databases claim support for better algorithms, optimized ingest/serving performance and better integration with LLM application development frameworks like Langchain/llamaIndex et al. + +With its shared-storage/decoupled compute model, the data lakehouse architecture has already proven scalability and cost-effectiveness advantages compared to storing all data for analysis and processing in shared-nothing database architectures or datawarehouses. We believe that extending data lakehouse storage and query engines with vector search capabilities can unlock best-of-both-worlds with some exciting outcomes. + +Storing vector indexes in a data lakehouse offers several advantages: +- **Infinitely scalable storage:** + - Overcomes the pain of storing/scaling large amounts of embeddings in an online database forever, reducing costs, while also allowing the production database to be more efficient/easier to operate. +- **Leverage scalable compute frameworks:** + - There is already rich support for compute frameworks (Spark, Flink) to build ingest pipelines to maintain embeddings from upstream sources, as well as fast query engines (Presto, Trino, Starrocks) that can serve vector searches. +- **Tiered serving layer:** + - Given much of the embedding data is updated from data pipelines, often every so often (versus real-time individual updates), we can also provide the reasonably fast serving of vector queries (e.g. 80% speed at 80% lower cost) by either extending the lakehouse storage with a caching tier (or) a tiered storage integration into existing production/online vector databases. They could serve applications with different end-user expectations e.g internal business apps vs user-facing applications. + +By extending the multi-modal indexing subsystem, vector indexes can be stored as part of the Hudi’s metadata tables and can be served directly using + + +## Background +Following are the goals for this RFC +- Creating vector indexes based on a base column in a table - either an embedding column or a text column. +- Indexes are automatically kept up to date when the base column changes, consistent with transactional boundaries. +- First-class SQL experience for creating, and dropping indexes (Spark) +- SQL extensions to query the index. (Spark, then Presto/Trino) Review Comment: @suryaprasanna I am thinking maybe for these two bullet points, we move these items scope of RFC 102, since it handles exposing the capability to perform to the user. ``` - First-class SQL experience for creating, and dropping indexes (Spark) - SQL extensions to query the index. (Spark, then Presto/Trino) ``` Will link the draft RFC 102: https://github.com/apache/hudi/issues/14219 around exposing vector search I also think that we have SQL semantics already for creating and dropping indexes in spark, for secondary index https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSecondaryIndexDataTypes.scala#L190C1-L191C1. So wondering if we can leverage similar wiring? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
