[
https://issues.apache.org/jira/browse/CASSANDRA-20086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18039661#comment-18039661
]
Michael Marshall commented on CASSANDRA-20086:
----------------------------------------------
I pushed a fix for the remaining tests that are failing, and also merged
`cassandra-5.0` into my branch. I just triggered another CI run:
http://ci-cassandra.infra.datastax.com/job/cassandra/54/
[~maedhroz] - this is ready for review
> Use similarity score ordered iterators in SAI vector search; fix handling of
> updated vectors
> --------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-20086
> URL: https://issues.apache.org/jira/browse/CASSANDRA-20086
> Project: Apache Cassandra
> Issue Type: Bug
> Components: Feature/Vector Search
> Reporter: Michael Marshall
> Assignee: Michael Marshall
> Priority: Urgent
> Fix For: 5.0.x, 5.x
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> The current vector search implementation has the following execution path
> within SAI:
> # Get topk vectors in score order (desc) from each vector index.
> # Sort in Primary Key order
> # Pull all results into the VectorTopKProcessor, sort/truncate to queried
> LIMIT
> # StorageAttachedIndexSearcher checks if there were any shadowed rows, and
> if there are, re-runs steps 1 through 3 until there are no additional
> shadowed rows.
> There are several problems with this approach.
> # It does not properly handle updated vectors, and can return low scoring
> vectors.
> # It materializes `limit * num segment indexes` rows on the happy path.
> # It is very sensitive to a single overwrite/tombstone and leads to timeouts
> in many simple use cases. This sensitivity is even worse for hybrid queries
> because the re-query logic forces us to search the boolean predicates again
> too.
> I propose that we modify the execution path within SAI and decouple the SAI
> vector search from other search types. We can return iterators out of the
> vector index that are in score order, not Primary Key order. In doing so, we
> can create merge-able iterators that will give us a single score ordered
> iterator from which we can materialize one row at a time in the
> VectorTopKProcessor.
> The solution removes the sensitivity to tombstones and properly handles
> updated vectors by making sure that the similarity score of the materialized
> vector matches the similarity score computed within the index.
> I implemented this in the DataStax fork here
> [https://github.com/datastax/cassandra/pull/929|https://github.com/datastax/cassandra/pull/929].
> The PR has quite a bit of detail and has been run in production for a while
> now, so it is hardened code. At that time, we saw hybrid query p99 latencies
> for some production workloads go from 2 seconds to 40 milliseconds, and for
> other workloads, we saw queries go from timing out to 200 milliseconds.
> I have continued to develop on the design, so the work that I am proposing
> bringing in to SAI will be a bit of an updated version of 929.
> I put together some mermaid diagrams to show how the patch affects query
> execution within SAI/vector:
> https://github.com/michaeljmarshall/cassandra/blob/vsearch/src/java/org/apache/cassandra/index/sai/VECTOR.md
> Here is a sample of a test that does not work for the current vector
> implementation but will pass with my patch.
> {code:java}
> @Test
> public void testUpdateVectorToWorseAndBetterPositions() throws Throwable
> {
> createTable("CREATE TABLE %s (pk int, val vector<float, 2>, PRIMARY
> KEY(pk))");
> createIndex("CREATE CUSTOM INDEX ON %s(val) USING
> 'StorageAttachedIndex'");
> execute("INSERT INTO %s (pk, val) VALUES (0, [1.0, 2.0])");
> execute("INSERT INTO %s (pk, val) VALUES (1, [1.0, 3.0])");
> flush();
> execute("INSERT INTO %s (pk, val) VALUES (0, [1.0, 4.0])");
> beforeAndAfterFlush(() -> {
> assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0]
> LIMIT 1"), row(1));
> assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0]
> LIMIT 2"), row(1), row(0));
> });
> // And now update pk 1 to show that we can get 0 too
> execute("INSERT INTO %s (pk, val) VALUES (1, [1.0, 5.0])");
> beforeAndAfterFlush(() -> {
> assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0]
> LIMIT 1"), row(0));
> assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0]
> LIMIT 2"), row(0), row(1));
> });
> // And now update both PKs so that the stream of ranked rows is PKs: 0,
> 1, [1], 0, 1, [0], where the numbers
> // wrapped in brackets are the "real" scores of the vectors. This test
> makes sure that we correctly remove
> // PrimaryKeys from the updatedKeys map so that we don't accidentally
> duplicate PKs.
> execute("INSERT INTO %s (pk, val) VALUES (1, [1.0, 3.5])");
> execute("INSERT INTO %s (pk, val) VALUES (0, [1.0, 6.0])");
> beforeAndAfterFlush(() -> {
> assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0]
> LIMIT 1"), row(1));
> assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0]
> LIMIT 2"), row(1), row(0));
> });
> } {code}
> Due to the correctness issues and the high probability of timeouts, as well
> as the known quality of this change, I propose that this target the
> cassandra-5.0 branch.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]