michaeljmarshall opened a new pull request, #4615: URL: https://github.com/apache/cassandra/pull/4615
(cherry picked from commit f2e34117d6a313be14156d92a3485f29424144c1) This PR implements the feature proposed in https://issues.apache.org/jira/browse/CASSANDRA-20086. ## Motivation Current ANN search does the following: 1. Get top k vectors from each index segment (memory or sstable) 2. Sort them in Primary Key order 3. Consume and materialize all of of them (`LIMIT * numSegments`) 4. Re-query with higher LIMIT if shadow keys are encountered (updates/overwrites) 5. Re-score to sort by similarity score 6. Take LIMIT best vectors 7. Sort by Primary Key This flow is both inefficient due to multiple sorts/requeries and can produce invalid results in cases where a row has an update to a vector to a worse score. The new flow: 1. Get top k vectors from each index segment (memory or sstable) 2. Build score ordered iterators 3. Merge those iterators, sorting by score descending 4. Consume, materialize, and validate rows from the global iterator until we have LIMIT rows 5. If an iterator is exhausted during step 4, requery the exhausted iterator's graph to get more vectors, using a bitset to ignore already visited nodes, until the graph is exhausted. 6. Sort LIMIT rows by primary key. This change removes two sort operations. It also fixes all update cases by asserting that a row's position within the score ordered iterator is only valid if the vector cell's materialized value is from the same sstable. ## Other minor changes 1. Increased test coverage by parameterizing the logic to test brute force, graph search, and "graph's choice". 2. Implemented custom brute force iterator that utilizes two priority queues to do the PQ based ranking then the full resolution ranking 3. Fixed a bug in the mapping of `limit` to `topk` where we could get `topk < limit` for limit 1000 There are still a number of TODOs. Some of these are left as possible jira tickets to create while others are questions for reviewers. I plan on completing some basic bench marks next week to demonstrate the value of this change. For now, I am opening this to start to get some reviews. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

