Approach section doesn’t go over how this will handle cross replica search, 
this would be good to flesh out… given results have a real ranking, the current 
2i logic may yield incorrect results… so would think we need num_ranges / rf 
queries in the best case, with some new capability to sort the results?  If my 
assumption is correct, then how errors are handled should also be fleshed out… 
Example: 1k cluster without vnode and RF=3, so 333 queries fanned out to match, 
then coordinator needs to sort… if 1 of the queries fails and can’t fall back 
to peers… does the query fail (I assume so)?

> On May 8, 2023, at 7:20 PM, Jonathan Ellis <jbel...@gmail.com> wrote:
> 
> Hi all,
> 
> Following the recent discussion threads, I would like to propose CEP-30 to 
> add Approximate Nearest Neighbor (ANN) Vector Search via Storage-Attached 
> Indexes (SAI) to Apache Cassandra.
> 
> The primary goal of this proposal is to implement ANN vector search 
> capabilities, making Cassandra more useful to AI developers and organizations 
> managing large datasets that can benefit from fast similarity search.
> 
> The implementation will leverage Lucene's Hierarchical Navigable Small World 
> (HNSW) library and introduce a new CQL data type for vector embeddings, a new 
> SAI index for ANN search functionality, and a new CQL operator for performing 
> ANN search queries.
> 
> We are targeting the 5.0 release for this feature, in conjunction with the 
> release of SAI. The proposed changes will maintain compatibility with 
> existing Cassandra functionality and compose well with the already-approved 
> SAI features.
> 
> Please find the full CEP document here: 
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes
> 
> -- 
> Jonathan Ellis
> co-founder, http://www.datastax.com <http://www.datastax.com/>
> @spyced

Reply via email to