Michael Marshall created CASSANDRA-20086:
--------------------------------------------

             Summary: Fix handling of updated vectors
                 Key: CASSANDRA-20086
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20086
             Project: Cassandra
          Issue Type: Bug
          Components: Feature/Vector Search
            Reporter: Michael Marshall
            Assignee: Michael Marshall


The current vector search implementation has the following execution path 
within SAI:
 # Get topk vectors in score order (desc) from each vector index.
 # Sort in Primary Key order
 # Pull all results into the VectorTopKProcessor, sort/truncate to queried LIMIT
 # StorageAttachedIndexSearcher checks if there were any shadowed rows, and if 
there are, re-runs steps 1 through 3 until there are no additional shadowed 
rows.

There are several problems with this approach.
 # It does not properly handle updated vectors, and can return low scoring 
vectors.
 # It materializes `limit * num segment indexes` rows on the happy path.
 # It is very sensitive to a single overwrite/tombstone and leads to timeouts 
in many simple use cases. This sensitivity is even worse for hybrid queries 
because the re-query logic forces us to search the boolean predicates again too.

I propose that we modify the execution path within SAI and decouple the SAI 
vector search from other search types. We can return iterators out of the 
vector index that are in score order, not Primary Key order. In doing so, we 
can create merge-able iterators that will give us a single score ordered 
iterator from which we can materialize one row at a time in the 
VectorTopKProcessor.

The solution removes the sensitivity to tombstones and properly handles updated 
vectors by making sure that the similarity score of the materialized vector 
matches the similarity score computed within the index.

I implemented this in the DataStax fork here 
[https://github.com/datastax/cassandra/pull/929|https://github.com/datastax/cassandra/pull/929.].
 The PR has quite a bit of detail and has been run in production for a while 
now, so it is hardened code. At that time, we saw hybrid query p99 latencies 
for some production workloads go from 2 seconds to 40 milliseconds, and for 
other workloads, we saw queries go from timing out to 200 milliseconds.

I have continued to develop on the design, so the work that I am proposing 
bringing in to SAI will be a bit of an updated version of 929.

The general design is documented in this readme: 
[https://github.com/datastax/cassandra/blob/main/src/java/org/apache/cassandra/index/sai/VECTOR.md|https://github.com/datastax/cassandra/blob/main/src/java/org/apache/cassandra/index/sai/VECTOR.md.].

Here is a sample of a test that does not work for the current vector 
implementation but will pass with my patch.
{code:java}
@Test
public void testUpdateVectorToWorseAndBetterPositions() throws Throwable
{
    createTable("CREATE TABLE %s (pk int, val vector<float, 2>, PRIMARY 
KEY(pk))");
    createIndex("CREATE CUSTOM INDEX ON %s(val) USING 'StorageAttachedIndex'");

    execute("INSERT INTO %s (pk, val) VALUES (0, [1.0, 2.0])");
    execute("INSERT INTO %s (pk, val) VALUES (1, [1.0, 3.0])");

    flush();
    execute("INSERT INTO %s (pk, val) VALUES (0, [1.0, 4.0])");

    beforeAndAfterFlush(() -> {
        assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] 
LIMIT 1"), row(1));
        assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] 
LIMIT 2"), row(1), row(0));
    });

    // And now update pk 1 to show that we can get 0 too
    execute("INSERT INTO %s (pk, val) VALUES (1, [1.0, 5.0])");

    beforeAndAfterFlush(() -> {
        assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] 
LIMIT 1"), row(0));
        assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] 
LIMIT 2"), row(0), row(1));
    });

    // And now update both PKs so that the stream of ranked rows is PKs: 0, 1, 
[1], 0, 1, [0], where the numbers
    // wrapped in brackets are the "real" scores of the vectors. This test 
makes sure that we correctly remove
    // PrimaryKeys from the updatedKeys map so that we don't accidentally 
duplicate PKs.
    execute("INSERT INTO %s (pk, val) VALUES (1, [1.0, 3.5])");
    execute("INSERT INTO %s (pk, val) VALUES (0, [1.0, 6.0])");

    beforeAndAfterFlush(() -> {
        assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] 
LIMIT 1"), row(1));
        assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] 
LIMIT 2"), row(1), row(0));
    });
} {code}
Due to the correctness issues and the high probability of timeouts, as well as 
the known quality of this change, I propose that this target the cassandra-5.0 
branch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to