[ https://issues.apache.org/jira/browse/CASSANDRA-20086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18021901#comment-18021901 ]
Michael Marshall commented on CASSANDRA-20086: ---------------------------------------------- I pushed a fix for some of the tests and for the lint errors and triggered another CI build http://ci-cassandra.infra.datastax.com/job/cassandra/41. Some of the tests are still expected to fail. I'll figure out a path forward on those soon. > Use similarity score ordered iterators in SAI vector search; fix handling of > updated vectors > -------------------------------------------------------------------------------------------- > > Key: CASSANDRA-20086 > URL: https://issues.apache.org/jira/browse/CASSANDRA-20086 > Project: Apache Cassandra > Issue Type: Bug > Components: Feature/Vector Search > Reporter: Michael Marshall > Assignee: Michael Marshall > Priority: Urgent > Fix For: 5.0.x, 5.x > > Time Spent: 10m > Remaining Estimate: 0h > > The current vector search implementation has the following execution path > within SAI: > # Get topk vectors in score order (desc) from each vector index. > # Sort in Primary Key order > # Pull all results into the VectorTopKProcessor, sort/truncate to queried > LIMIT > # StorageAttachedIndexSearcher checks if there were any shadowed rows, and > if there are, re-runs steps 1 through 3 until there are no additional > shadowed rows. > There are several problems with this approach. > # It does not properly handle updated vectors, and can return low scoring > vectors. > # It materializes `limit * num segment indexes` rows on the happy path. > # It is very sensitive to a single overwrite/tombstone and leads to timeouts > in many simple use cases. This sensitivity is even worse for hybrid queries > because the re-query logic forces us to search the boolean predicates again > too. > I propose that we modify the execution path within SAI and decouple the SAI > vector search from other search types. We can return iterators out of the > vector index that are in score order, not Primary Key order. In doing so, we > can create merge-able iterators that will give us a single score ordered > iterator from which we can materialize one row at a time in the > VectorTopKProcessor. > The solution removes the sensitivity to tombstones and properly handles > updated vectors by making sure that the similarity score of the materialized > vector matches the similarity score computed within the index. > I implemented this in the DataStax fork here > [https://github.com/datastax/cassandra/pull/929|https://github.com/datastax/cassandra/pull/929]. > The PR has quite a bit of detail and has been run in production for a while > now, so it is hardened code. At that time, we saw hybrid query p99 latencies > for some production workloads go from 2 seconds to 40 milliseconds, and for > other workloads, we saw queries go from timing out to 200 milliseconds. > I have continued to develop on the design, so the work that I am proposing > bringing in to SAI will be a bit of an updated version of 929. > I put together some mermaid diagrams to show how the patch affects query > execution within SAI/vector: > https://github.com/michaeljmarshall/cassandra/blob/vsearch/src/java/org/apache/cassandra/index/sai/VECTOR.md > Here is a sample of a test that does not work for the current vector > implementation but will pass with my patch. > {code:java} > @Test > public void testUpdateVectorToWorseAndBetterPositions() throws Throwable > { > createTable("CREATE TABLE %s (pk int, val vector<float, 2>, PRIMARY > KEY(pk))"); > createIndex("CREATE CUSTOM INDEX ON %s(val) USING > 'StorageAttachedIndex'"); > execute("INSERT INTO %s (pk, val) VALUES (0, [1.0, 2.0])"); > execute("INSERT INTO %s (pk, val) VALUES (1, [1.0, 3.0])"); > flush(); > execute("INSERT INTO %s (pk, val) VALUES (0, [1.0, 4.0])"); > beforeAndAfterFlush(() -> { > assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] > LIMIT 1"), row(1)); > assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] > LIMIT 2"), row(1), row(0)); > }); > // And now update pk 1 to show that we can get 0 too > execute("INSERT INTO %s (pk, val) VALUES (1, [1.0, 5.0])"); > beforeAndAfterFlush(() -> { > assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] > LIMIT 1"), row(0)); > assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] > LIMIT 2"), row(0), row(1)); > }); > // And now update both PKs so that the stream of ranked rows is PKs: 0, > 1, [1], 0, 1, [0], where the numbers > // wrapped in brackets are the "real" scores of the vectors. This test > makes sure that we correctly remove > // PrimaryKeys from the updatedKeys map so that we don't accidentally > duplicate PKs. > execute("INSERT INTO %s (pk, val) VALUES (1, [1.0, 3.5])"); > execute("INSERT INTO %s (pk, val) VALUES (0, [1.0, 6.0])"); > beforeAndAfterFlush(() -> { > assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] > LIMIT 1"), row(1)); > assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] > LIMIT 2"), row(1), row(0)); > }); > } {code} > Due to the correctness issues and the high probability of timeouts, as well > as the known quality of this change, I propose that this target the > cassandra-5.0 branch. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org