Approach section doesn’t go over how this will handle cross replica search, this would be good to flesh out… given results have a real ranking, the current 2i logic may yield incorrect results… so would think we need num_ranges / rf queries in the best case, with some new capability to sort the results? If my assumption is correct, then how errors are handled should also be fleshed out… Example: 1k cluster without vnode and RF=3, so 333 queries fanned out to match, then coordinator needs to sort… if 1 of the queries fails and can’t fall back to peers… does the query fail (I assume so)?
> On May 8, 2023, at 7:20 PM, Jonathan Ellis <jbel...@gmail.com> wrote: > > Hi all, > > Following the recent discussion threads, I would like to propose CEP-30 to > add Approximate Nearest Neighbor (ANN) Vector Search via Storage-Attached > Indexes (SAI) to Apache Cassandra. > > The primary goal of this proposal is to implement ANN vector search > capabilities, making Cassandra more useful to AI developers and organizations > managing large datasets that can benefit from fast similarity search. > > The implementation will leverage Lucene's Hierarchical Navigable Small World > (HNSW) library and introduce a new CQL data type for vector embeddings, a new > SAI index for ANN search functionality, and a new CQL operator for performing > ANN search queries. > > We are targeting the 5.0 release for this feature, in conjunction with the > release of SAI. The proposed changes will maintain compatibility with > existing Cassandra functionality and compose well with the already-approved > SAI features. > > Please find the full CEP document here: > https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes > > -- > Jonathan Ellis > co-founder, http://www.datastax.com <http://www.datastax.com/> > @spyced