Re: CEP-30: Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes

2023-05-17 Thread David Capwell
Thanks for the update, LGTM

> On May 17, 2023, at 5:35 AM, Jasonstack Zhao Yang  
> wrote:
> 
> Hi,
> 
> I have updated the CEP with some details about distributed queries in the 
> Approach section.
> 
> David:
> 
> > given results have a real ranking, the current 2i logic may yield incorrect 
> > results
> 
> C* internal iterators are all in primary key order. So we need two in-memory 
> top-k filters, one at replica side and one at coordinator side, to make sure 
> the returned rows are actually top-k but still primary key order.
> 
> > if 1 of the queries fails and can’t fall back to peers… does the query fail 
> > (I assume so)
> 
> yes, it will fail. we can make it pass if lower recall is acceptable.
> 
> Caleb:
> 
> > With smaller clusters or use-cases that are extremely 
> > write-heavy/read-light, it's possible that the full scatter/gather won't be 
> > too onerous, especially w/ a few small tweaks (on top of a non-vnode 
> > cluster)
> 
> You are right. Smaller cluster would definitely requires less coordinator 
> memory to cache all required replicas' responses.
> 
> 
> Jeremy:
> 
> >  With SAI, can you have partial results?  When you have a query that is 
> > non-key based, you need to have full token range coverage of the results.  
> > If that isn't possible, will Vector Search/SAI return partial results?
> 
> No partial result allowed. Query will failed with unavailability exception if 
> some required token range is not available. For ANN search, users might be 
> willing to have lower recall (partial results) with higher availability.
> 
> >  First, how is ordering/scoring done?
> > Each replica returns back to the coordinator a sorted set of results and 
> > the coordinator will have to see all of the results globally in order to do 
> > a global ordering.  You can't know what the top result is unless you've 
> > seen everything.  As to the scoring, I'm not sure how that will get 
> > calculated.
> 
> The results will be top-k but still in primary key order. Scores are computed 
> based on vector similarly function.
> 
> Top-K search need two top-k filter as described in CEP. 
> 
> > Second, if I am ordering the results like for a Vector Search and I want to 
> > have the top 1 result.  How is the scoring done and what happens if there 
> > are 20 that have the same score?  How will the coordinator decide which 1 
> > is returned out of 20?
> 
> It will be the row with smaller primary key order.
> 
> On Wed, 10 May 2023 at 05:39, Jeremy Hanna  > wrote:
>> Just wanted to add that I don't have any special knowledge of CEP-30 beyond 
>> what Jonathan posted and just trying to help clarify and answer questions as 
>> I can with some knowledge and experience from DSE Search and SAI.  Thanks to 
>> Caleb for helping validate some things as well.  And to be clear about 
>> partial results - the default with DSE Search at least is to fail a query if 
>> it can't get the full token range coverage.  However there is an option to 
>> allow for shards being unavailable and return partial results.
>> 
>>> On May 9, 2023, at 3:38 PM, Jeremy Hanna >> > wrote:
>>> 
>>> I talked to David and some others in slack to hopefully clarify:
>>> 
>>> With SAI, can you have partial results?  When you have a query that is 
>>> non-key based, you need to have full token range coverage of the results.  
>>> If that isn't possible, will Vector Search/SAI return partial results?
>>> 
>>> Anything can happen in the implementation, but for scoring, it may not make 
>>> sense to return partial results because it's misleading.  For non-global 
>>> queries, it could or couldn't return partial results depending on 
>>> implementation/configuration.  In DSE you could have partial results 
>>> depending on the options.   However I couldn't find partial results defined 
>>> in CEP-7 or CEP-30.
>>> 
>>> The other questions are about scoring.
>>> 
>>> First, how is ordering/scoring done?
>>> 
>>> Each replica returns back to the coordinator a sorted set of results and 
>>> the coordinator will have to see all of the results globally in order to do 
>>> a global ordering.  You can't know what the top result is unless you've 
>>> seen everything.  As to the scoring, I'm not sure how that will get 
>>> calculated.
>>> 
>>> Second, if I am ordering the results like for a Vector Search and I want to 
>>> have the top 1 result.  How is the scoring done and what happens if there 
>>> are 20 that have the same score?  How will the coordinator decide which 1 
>>> is returned out of 20?
>>> 
>>> It returns results in token/partition and then clustering order.
>>> 
 On May 9, 2023, at 2:53 PM, Caleb Rackliffe >>> > wrote:
 
 Anyone on this ML who still remembers DSE Search (or has experience w/ 
 Elastic or SolrCloud) probably also knows that there are some significant 
 pieces of an optimized scatter/gather apparatus 

Re: CEP-30: Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes

2023-05-17 Thread Jasonstack Zhao Yang
Hi,

I have updated the CEP with some details about distributed queries in the
*Approach* section.

David:

> given results have a real ranking, the current 2i logic may yield
incorrect results

C* internal iterators are all in primary key order. So we need two
in-memory top-k filters, one at replica side and one at coordinator side,
to make sure the returned rows are actually top-k but still primary key
order.

> if 1 of the queries fails and can’t fall back to peers… does the query
fail (I assume so)

yes, it will fail. we can make it pass if lower recall is acceptable.

Caleb:

> With smaller clusters or use-cases that are extremely
write-heavy/read-light, it's possible that the full scatter/gather won't be
too onerous, especially w/ a few small tweaks (on top of a non-vnode
cluster)

You are right. Smaller cluster would definitely requires less coordinator
memory to cache all required replicas' responses.


Jeremy:

>  With SAI, can you have partial results?  When you have a query that is
non-key based, you need to have full token range coverage of the results.
If that isn't possible, will Vector Search/SAI return partial results?

No partial result allowed. Query will failed with unavailability exception
if some required token range is not available. For ANN search, users might
be willing to have lower recall (partial results) with higher availability.

>  First, how is ordering/scoring done?
> Each replica returns back to the coordinator a sorted set of results and
the coordinator will have to see all of the results globally in order to do
a global ordering.  You can't know what the top result is unless you've
seen everything.  As to the scoring, I'm not sure how that will get
calculated.

The results will be top-k but still in primary key order. Scores are
computed based on vector similarly function.

Top-K search need two top-k filter as described in CEP.

> Second, if I am ordering the results like for a Vector Search and I want
to have the top 1 result.  How is the scoring done and what happens if
there are 20 that have the same score?  How will the coordinator decide
which 1 is returned out of 20?

It will be the row with smaller primary key order.

On Wed, 10 May 2023 at 05:39, Jeremy Hanna 
wrote:

> Just wanted to add that I don't have any special knowledge of CEP-30
> beyond what Jonathan posted and just trying to help clarify and answer
> questions as I can with some knowledge and experience from DSE Search and
> SAI.  Thanks to Caleb for helping validate some things as well.  And to be
> clear about partial results - the default with DSE Search at least is to
> fail a query if it can't get the full token range coverage.  However there
> is an option to allow for shards being unavailable and return partial
> results.
>
> On May 9, 2023, at 3:38 PM, Jeremy Hanna 
> wrote:
>
> I talked to David and some others in slack to hopefully clarify:
>
> With SAI, can you have partial results?  When you have a query that is
> non-key based, you need to have full token range coverage of the results.
> If that isn't possible, will Vector Search/SAI return partial results?
>
> Anything can happen in the implementation, but for scoring, it may not
> make sense to return partial results because it's misleading.  For
> non-global queries, it could or couldn't return partial results depending
> on implementation/configuration.  In DSE you could have partial results
> depending on the options.   However I couldn't find partial results defined
> in CEP-7 or CEP-30.
>
> The other questions are about scoring.
>
> First, how is ordering/scoring done?
>
> Each replica returns back to the coordinator a sorted set of results and
> the coordinator will have to see all of the results globally in order to do
> a global ordering.  You can't know what the top result is unless you've
> seen everything.  As to the scoring, I'm not sure how that will get
> calculated.
>
> Second, if I am ordering the results like for a Vector Search and I want
> to have the top 1 result.  How is the scoring done and what happens if
> there are 20 that have the same score?  How will the coordinator decide
> which 1 is returned out of 20?
>
> It returns results in token/partition and then clustering order.
>
> On May 9, 2023, at 2:53 PM, Caleb Rackliffe 
> wrote:
>
> Anyone on this ML who still remembers DSE Search (or has experience w/
> Elastic or SolrCloud) probably also knows that there are some significant
> pieces of an optimized scatter/gather apparatus for IR (even without
> sorting, which also doesn't exist yet) that do not exist in C* or it's
> range query system (which SAI and all other 2i implementations use). SAI,
> like all C* 2i implementations, is still a local index, and as that is the
> case, anything built on it will perform best in partition-scoped (at least
> on the read side) use-cases. (On the bright side, the project is moving
> toward larger partitions being a possibility.) With smaller clusters or
> use-cases that 

Re: CEP-30: Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes

2023-05-09 Thread Jeremy Hanna
Just wanted to add that I don't have any special knowledge of CEP-30 beyond 
what Jonathan posted and just trying to help clarify and answer questions as I 
can with some knowledge and experience from DSE Search and SAI.  Thanks to 
Caleb for helping validate some things as well.  And to be clear about partial 
results - the default with DSE Search at least is to fail a query if it can't 
get the full token range coverage.  However there is an option to allow for 
shards being unavailable and return partial results.

> On May 9, 2023, at 3:38 PM, Jeremy Hanna  wrote:
> 
> I talked to David and some others in slack to hopefully clarify:
> 
> With SAI, can you have partial results?  When you have a query that is 
> non-key based, you need to have full token range coverage of the results.  If 
> that isn't possible, will Vector Search/SAI return partial results?
> 
> Anything can happen in the implementation, but for scoring, it may not make 
> sense to return partial results because it's misleading.  For non-global 
> queries, it could or couldn't return partial results depending on 
> implementation/configuration.  In DSE you could have partial results 
> depending on the options.   However I couldn't find partial results defined 
> in CEP-7 or CEP-30.
> 
> The other questions are about scoring.
> 
> First, how is ordering/scoring done?
> 
> Each replica returns back to the coordinator a sorted set of results and the 
> coordinator will have to see all of the results globally in order to do a 
> global ordering.  You can't know what the top result is unless you've seen 
> everything.  As to the scoring, I'm not sure how that will get calculated.
> 
> Second, if I am ordering the results like for a Vector Search and I want to 
> have the top 1 result.  How is the scoring done and what happens if there are 
> 20 that have the same score?  How will the coordinator decide which 1 is 
> returned out of 20?
> 
> It returns results in token/partition and then clustering order.
> 
>> On May 9, 2023, at 2:53 PM, Caleb Rackliffe  wrote:
>> 
>> Anyone on this ML who still remembers DSE Search (or has experience w/ 
>> Elastic or SolrCloud) probably also knows that there are some significant 
>> pieces of an optimized scatter/gather apparatus for IR (even without 
>> sorting, which also doesn't exist yet) that do not exist in C* or it's range 
>> query system (which SAI and all other 2i implementations use). SAI, like all 
>> C* 2i implementations, is still a local index, and as that is the case, 
>> anything built on it will perform best in partition-scoped (at least on the 
>> read side) use-cases. (On the bright side, the project is moving toward 
>> larger partitions being a possibility.) With smaller clusters or use-cases 
>> that are extremely write-heavy/read-light, it's possible that the full 
>> scatter/gather won't be too onerous, especially w/ a few small tweaks (on 
>> top of a non-vnode cluster) to a.) keep fanout minimal and b.) keep 
>> range/index queries to a single pass to minimize latency.
>> 
>> Whatever we do, we just need to avoid a situation down the road where users 
>> don't understand these nuances and hit a wall where they try to use this in 
>> a way that is fundamentally incompatible w/ the way the database 
>> scales/works. (I've done my best to call this out in all discussions around 
>> SAI over time, and there may even end up being further guardrails put in 
>> place to make it even harder to misuse it...but I digress.)
>> 
>> Having said all that, I don't fundamentally have a problem w/ the proposal.
>> 
>> On Tue, May 9, 2023 at 2:11 PM Benedict > > wrote:
>>> HNSW can in principle be made into a distributed index. But that would be 
>>> quite a different paradigm to SAI.
>>> 
 On 9 May 2023, at 19:30, Patrick McFadin >>> > wrote:
 
 
 Under the goals section, there is this line:
 
 Scatter/gather across replicas, combining topK from each to get global 
 topK.
 
 But what I'm hearing is, exactly how will that happen? Maybe this is an 
 SAI question too. How is that verified in SAI?
 
 On Tue, May 9, 2023 at 11:07 AM David Capwell >>> > wrote:
> Approach section doesn’t go over how this will handle cross replica 
> search, this would be good to flesh out… given results have a real 
> ranking, the current 2i logic may yield incorrect results… so would think 
> we need num_ranges / rf queries in the best case, with some new 
> capability to sort the results?  If my assumption is correct, then how 
> errors are handled should also be fleshed out… Example: 1k cluster 
> without vnode and RF=3, so 333 queries fanned out to match, then 
> coordinator needs to sort… if 1 of the queries fails and can’t fall back 
> to peers… does the query fail (I assume so)?
> 
>> On May 8, 2023, at 7:20 PM, Jonathan Ellis > 

Re: CEP-30: Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes

2023-05-09 Thread Jeremy Hanna
I talked to David and some others in slack to hopefully clarify:

With SAI, can you have partial results?  When you have a query that is non-key 
based, you need to have full token range coverage of the results.  If that 
isn't possible, will Vector Search/SAI return partial results?

Anything can happen in the implementation, but for scoring, it may not make 
sense to return partial results because it's misleading.  For non-global 
queries, it could or couldn't return partial results depending on 
implementation/configuration.  In DSE you could have partial results depending 
on the options.   However I couldn't find partial results defined in CEP-7 or 
CEP-30.

The other questions are about scoring.

First, how is ordering/scoring done?

Each replica returns back to the coordinator a sorted set of results and the 
coordinator will have to see all of the results globally in order to do a 
global ordering.  You can't know what the top result is unless you've seen 
everything.  As to the scoring, I'm not sure how that will get calculated.

Second, if I am ordering the results like for a Vector Search and I want to 
have the top 1 result.  How is the scoring done and what happens if there are 
20 that have the same score?  How will the coordinator decide which 1 is 
returned out of 20?

It returns results in token/partition and then clustering order.

> On May 9, 2023, at 2:53 PM, Caleb Rackliffe  wrote:
> 
> Anyone on this ML who still remembers DSE Search (or has experience w/ 
> Elastic or SolrCloud) probably also knows that there are some significant 
> pieces of an optimized scatter/gather apparatus for IR (even without sorting, 
> which also doesn't exist yet) that do not exist in C* or it's range query 
> system (which SAI and all other 2i implementations use). SAI, like all C* 2i 
> implementations, is still a local index, and as that is the case, anything 
> built on it will perform best in partition-scoped (at least on the read side) 
> use-cases. (On the bright side, the project is moving toward larger 
> partitions being a possibility.) With smaller clusters or use-cases that are 
> extremely write-heavy/read-light, it's possible that the full scatter/gather 
> won't be too onerous, especially w/ a few small tweaks (on top of a non-vnode 
> cluster) to a.) keep fanout minimal and b.) keep range/index queries to a 
> single pass to minimize latency.
> 
> Whatever we do, we just need to avoid a situation down the road where users 
> don't understand these nuances and hit a wall where they try to use this in a 
> way that is fundamentally incompatible w/ the way the database scales/works. 
> (I've done my best to call this out in all discussions around SAI over time, 
> and there may even end up being further guardrails put in place to make it 
> even harder to misuse it...but I digress.)
> 
> Having said all that, I don't fundamentally have a problem w/ the proposal.
> 
> On Tue, May 9, 2023 at 2:11 PM Benedict  > wrote:
>> HNSW can in principle be made into a distributed index. But that would be 
>> quite a different paradigm to SAI.
>> 
>>> On 9 May 2023, at 19:30, Patrick McFadin >> > wrote:
>>> 
>>> 
>>> Under the goals section, there is this line:
>>> 
>>> Scatter/gather across replicas, combining topK from each to get global topK.
>>> 
>>> But what I'm hearing is, exactly how will that happen? Maybe this is an SAI 
>>> question too. How is that verified in SAI?
>>> 
>>> On Tue, May 9, 2023 at 11:07 AM David Capwell >> > wrote:
 Approach section doesn’t go over how this will handle cross replica 
 search, this would be good to flesh out… given results have a real 
 ranking, the current 2i logic may yield incorrect results… so would think 
 we need num_ranges / rf queries in the best case, with some new capability 
 to sort the results?  If my assumption is correct, then how errors are 
 handled should also be fleshed out… Example: 1k cluster without vnode and 
 RF=3, so 333 queries fanned out to match, then coordinator needs to sort… 
 if 1 of the queries fails and can’t fall back to peers… does the query 
 fail (I assume so)?
 
> On May 8, 2023, at 7:20 PM, Jonathan Ellis  > wrote:
> 
> Hi all,
> 
> Following the recent discussion threads, I would like to propose CEP-30 
> to add Approximate Nearest Neighbor (ANN) Vector Search via 
> Storage-Attached Indexes (SAI) to Apache Cassandra.
> 
> The primary goal of this proposal is to implement ANN vector search 
> capabilities, making Cassandra more useful to AI developers and 
> organizations managing large datasets that can benefit from fast 
> similarity search.
> 
> The implementation will leverage Lucene's Hierarchical Navigable Small 
> World (HNSW) library and introduce a new CQL data type for vector 
> embeddings, a new SAI 

Re: CEP-30: Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes

2023-05-09 Thread Caleb Rackliffe
Anyone on this ML who still remembers DSE Search (or has experience w/
Elastic or SolrCloud) probably also knows that there are some significant
pieces of an optimized scatter/gather apparatus for IR (even without
sorting, which also doesn't exist yet) that do not exist in C* or it's
range query system (which SAI and all other 2i implementations use). SAI,
like all C* 2i implementations, is still a local index, and as that is the
case, anything built on it will perform best in partition-scoped (at least
on the read side) use-cases. (On the bright side, the project is moving
toward larger partitions being a possibility.) With smaller clusters or
use-cases that are extremely write-heavy/read-light, it's possible that the
full scatter/gather won't be too onerous, especially w/ a few small tweaks
(on top of a non-vnode cluster) to a.) keep fanout minimal and b.) keep
range/index queries to a single pass to minimize latency.

Whatever we do, we just need to avoid a situation down the road where users
don't understand these nuances and hit a wall where they try to use this in
a way that is fundamentally incompatible w/ the way the database
scales/works. (I've done my best to call this out in all discussions around
SAI over time, and there may even end up being further guardrails put in
place to make it even harder to misuse it...but I digress.)

Having said all that, I don't fundamentally have a problem w/ the proposal.

On Tue, May 9, 2023 at 2:11 PM Benedict  wrote:

> HNSW can in principle be made into a distributed index. But that would be
> quite a different paradigm to SAI.
>
> On 9 May 2023, at 19:30, Patrick McFadin  wrote:
>
> 
> Under the goals section, there is this line:
>
>
>1. Scatter/gather across replicas, combining topK from each to get
>global topK.
>
>
> But what I'm hearing is, exactly how will that happen? Maybe this is an
> SAI question too. How is that verified in SAI?
>
> On Tue, May 9, 2023 at 11:07 AM David Capwell  wrote:
>
>> Approach section doesn’t go over how this will handle cross replica
>> search, this would be good to flesh out… given results have a real ranking,
>> the current 2i logic may yield incorrect results… so would think we need
>> num_ranges / rf queries in the best case, with some new capability to sort
>> the results?  If my assumption is correct, then how errors are handled
>> should also be fleshed out… Example: 1k cluster without vnode and RF=3, so
>> 333 queries fanned out to match, then coordinator needs to sort… if 1 of
>> the queries fails and can’t fall back to peers… does the query fail (I
>> assume so)?
>>
>> On May 8, 2023, at 7:20 PM, Jonathan Ellis  wrote:
>>
>> Hi all,
>>
>> Following the recent discussion threads, I would like to propose CEP-30
>> to add Approximate Nearest Neighbor (ANN) Vector Search via
>> Storage-Attached Indexes (SAI) to Apache Cassandra.
>>
>> The primary goal of this proposal is to implement ANN vector search
>> capabilities, making Cassandra more useful to AI developers and
>> organizations managing large datasets that can benefit from fast similarity
>> search.
>>
>> The implementation will leverage Lucene's Hierarchical Navigable Small
>> World (HNSW) library and introduce a new CQL data type for vector
>> embeddings, a new SAI index for ANN search functionality, and a new CQL
>> operator for performing ANN search queries.
>>
>> We are targeting the 5.0 release for this feature, in conjunction with
>> the release of SAI. The proposed changes will maintain compatibility with
>> existing Cassandra functionality and compose well with the already-approved
>> SAI features.
>>
>> Please find the full CEP document here:
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes
>>
>> --
>> Jonathan Ellis
>> co-founder, http://www.datastax.com
>> @spyced
>>
>>
>>


Re: CEP-30: Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes

2023-05-09 Thread Benedict
HNSW can in principle be made into a distributed index. But that would be quite a different paradigm to SAI.On 9 May 2023, at 19:30, Patrick McFadin  wrote:Under the goals section, there is this line:Scatter/gather across replicas, combining topK from each to get global topK.But what I'm hearing is, exactly how will that happen? Maybe this is an SAI question too. How is that verified in SAI?On Tue, May 9, 2023 at 11:07 AM David Capwell  wrote:Approach section doesn’t go over how this will handle cross replica search, this would be good to flesh out… given results have a real ranking, the current 2i logic may yield incorrect results… so would think we need num_ranges / rf queries in the best case, with some new capability to sort the results?  If my assumption is correct, then how errors are handled should also be fleshed out… Example: 1k cluster without vnode and RF=3, so 333 queries fanned out to match, then coordinator needs to sort… if 1 of the queries fails and can’t fall back to peers… does the query fail (I assume so)?On May 8, 2023, at 7:20 PM, Jonathan Ellis  wrote:Hi all,Following the recent discussion threads, I would like to propose CEP-30 to add Approximate Nearest Neighbor (ANN) Vector Search via Storage-Attached Indexes (SAI) to Apache Cassandra.The primary goal of this proposal is to implement ANN vector search capabilities, making Cassandra more useful to AI developers and organizations managing large datasets that can benefit from fast similarity search.The implementation will leverage Lucene's Hierarchical Navigable Small World (HNSW) library and introduce a new CQL data type for vector embeddings, a new SAI index for ANN search functionality, and a new CQL operator for performing ANN search queries.We are targeting the 5.0 release for this feature, in conjunction with the release of SAI. The proposed changes will maintain compatibility with existing Cassandra functionality and compose well with the already-approved SAI features.Please find the full CEP document here: https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes-- Jonathan Ellisco-founder, http://www.datastax.com@spyced



Re: CEP-30: Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes

2023-05-09 Thread Patrick McFadin
Under the goals section, there is this line:


   1. Scatter/gather across replicas, combining topK from each to get
   global topK.


But what I'm hearing is, exactly how will that happen? Maybe this is an SAI
question too. How is that verified in SAI?

On Tue, May 9, 2023 at 11:07 AM David Capwell  wrote:

> Approach section doesn’t go over how this will handle cross replica
> search, this would be good to flesh out… given results have a real ranking,
> the current 2i logic may yield incorrect results… so would think we need
> num_ranges / rf queries in the best case, with some new capability to sort
> the results?  If my assumption is correct, then how errors are handled
> should also be fleshed out… Example: 1k cluster without vnode and RF=3, so
> 333 queries fanned out to match, then coordinator needs to sort… if 1 of
> the queries fails and can’t fall back to peers… does the query fail (I
> assume so)?
>
> On May 8, 2023, at 7:20 PM, Jonathan Ellis  wrote:
>
> Hi all,
>
> Following the recent discussion threads, I would like to propose CEP-30 to
> add Approximate Nearest Neighbor (ANN) Vector Search via Storage-Attached
> Indexes (SAI) to Apache Cassandra.
>
> The primary goal of this proposal is to implement ANN vector search
> capabilities, making Cassandra more useful to AI developers and
> organizations managing large datasets that can benefit from fast similarity
> search.
>
> The implementation will leverage Lucene's Hierarchical Navigable Small
> World (HNSW) library and introduce a new CQL data type for vector
> embeddings, a new SAI index for ANN search functionality, and a new CQL
> operator for performing ANN search queries.
>
> We are targeting the 5.0 release for this feature, in conjunction with the
> release of SAI. The proposed changes will maintain compatibility with
> existing Cassandra functionality and compose well with the already-approved
> SAI features.
>
> Please find the full CEP document here:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>
>
>


Re: CEP-30: Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes

2023-05-09 Thread David Capwell
Approach section doesn’t go over how this will handle cross replica search, 
this would be good to flesh out… given results have a real ranking, the current 
2i logic may yield incorrect results… so would think we need num_ranges / rf 
queries in the best case, with some new capability to sort the results?  If my 
assumption is correct, then how errors are handled should also be fleshed out… 
Example: 1k cluster without vnode and RF=3, so 333 queries fanned out to match, 
then coordinator needs to sort… if 1 of the queries fails and can’t fall back 
to peers… does the query fail (I assume so)?

> On May 8, 2023, at 7:20 PM, Jonathan Ellis  wrote:
> 
> Hi all,
> 
> Following the recent discussion threads, I would like to propose CEP-30 to 
> add Approximate Nearest Neighbor (ANN) Vector Search via Storage-Attached 
> Indexes (SAI) to Apache Cassandra.
> 
> The primary goal of this proposal is to implement ANN vector search 
> capabilities, making Cassandra more useful to AI developers and organizations 
> managing large datasets that can benefit from fast similarity search.
> 
> The implementation will leverage Lucene's Hierarchical Navigable Small World 
> (HNSW) library and introduce a new CQL data type for vector embeddings, a new 
> SAI index for ANN search functionality, and a new CQL operator for performing 
> ANN search queries.
> 
> We are targeting the 5.0 release for this feature, in conjunction with the 
> release of SAI. The proposed changes will maintain compatibility with 
> existing Cassandra functionality and compose well with the already-approved 
> SAI features.
> 
> Please find the full CEP document here: 
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes
> 
> -- 
> Jonathan Ellis
> co-founder, http://www.datastax.com 
> @spyced