[ 
https://issues.apache.org/jira/browse/CASSANDRA-20086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Marshall updated CASSANDRA-20086:
-----------------------------------------
    Description: 
The current vector search implementation has the following execution path 
within SAI:
 # Get topk vectors in score order (desc) from each vector index.
 # Sort in Primary Key order
 # Pull all results into the VectorTopKProcessor, sort/truncate to queried LIMIT
 # StorageAttachedIndexSearcher checks if there were any shadowed rows, and if 
there are, re-runs steps 1 through 3 until there are no additional shadowed 
rows.

There are several problems with this approach.
 # It does not properly handle updated vectors, and can return low scoring 
vectors.
 # It materializes `limit * num segment indexes` rows on the happy path.
 # It is very sensitive to a single overwrite/tombstone and leads to timeouts 
in many simple use cases. This sensitivity is even worse for hybrid queries 
because the re-query logic forces us to search the boolean predicates again too.

I propose that we modify the execution path within SAI and decouple the SAI 
vector search from other search types. We can return iterators out of the 
vector index that are in score order, not Primary Key order. In doing so, we 
can create merge-able iterators that will give us a single score ordered 
iterator from which we can materialize one row at a time in the 
VectorTopKProcessor.

The solution removes the sensitivity to tombstones and properly handles updated 
vectors by making sure that the similarity score of the materialized vector 
matches the similarity score computed within the index.

I implemented this in the DataStax fork here 
[https://github.com/datastax/cassandra/pull/929|https://github.com/datastax/cassandra/pull/929.].
 The PR has quite a bit of detail and has been run in production for a while 
now, so it is hardened code. At that time, we saw hybrid query p99 latencies 
for some production workloads go from 2 seconds to 40 milliseconds, and for 
other workloads, we saw queries go from timing out to 200 milliseconds.

I have continued to develop on the design, so the work that I am proposing 
bringing in to SAI will be a bit of an updated version of 929.

I put together some mermaid diagrams to show how the patch affects query 
execution within SAI/vector: 
https://github.com/michaeljmarshall/cassandra/blob/vsearch/src/java/org/apache/cassandra/index/sai/VECTOR.md

Here is a sample of a test that does not work for the current vector 
implementation but will pass with my patch.
{code:java}
@Test
public void testUpdateVectorToWorseAndBetterPositions() throws Throwable
{
    createTable("CREATE TABLE %s (pk int, val vector<float, 2>, PRIMARY 
KEY(pk))");
    createIndex("CREATE CUSTOM INDEX ON %s(val) USING 'StorageAttachedIndex'");

    execute("INSERT INTO %s (pk, val) VALUES (0, [1.0, 2.0])");
    execute("INSERT INTO %s (pk, val) VALUES (1, [1.0, 3.0])");

    flush();
    execute("INSERT INTO %s (pk, val) VALUES (0, [1.0, 4.0])");

    beforeAndAfterFlush(() -> {
        assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] 
LIMIT 1"), row(1));
        assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] 
LIMIT 2"), row(1), row(0));
    });

    // And now update pk 1 to show that we can get 0 too
    execute("INSERT INTO %s (pk, val) VALUES (1, [1.0, 5.0])");

    beforeAndAfterFlush(() -> {
        assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] 
LIMIT 1"), row(0));
        assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] 
LIMIT 2"), row(0), row(1));
    });

    // And now update both PKs so that the stream of ranked rows is PKs: 0, 1, 
[1], 0, 1, [0], where the numbers
    // wrapped in brackets are the "real" scores of the vectors. This test 
makes sure that we correctly remove
    // PrimaryKeys from the updatedKeys map so that we don't accidentally 
duplicate PKs.
    execute("INSERT INTO %s (pk, val) VALUES (1, [1.0, 3.5])");
    execute("INSERT INTO %s (pk, val) VALUES (0, [1.0, 6.0])");

    beforeAndAfterFlush(() -> {
        assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] 
LIMIT 1"), row(1));
        assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] 
LIMIT 2"), row(1), row(0));
    });
} {code}
Due to the correctness issues and the high probability of timeouts, as well as 
the known quality of this change, I propose that this target the cassandra-5.0 
branch.

  was:
The current vector search implementation has the following execution path 
within SAI:
 # Get topk vectors in score order (desc) from each vector index.
 # Sort in Primary Key order
 # Pull all results into the VectorTopKProcessor, sort/truncate to queried LIMIT
 # StorageAttachedIndexSearcher checks if there were any shadowed rows, and if 
there are, re-runs steps 1 through 3 until there are no additional shadowed 
rows.

There are several problems with this approach.
 # It does not properly handle updated vectors, and can return low scoring 
vectors.
 # It materializes `limit * num segment indexes` rows on the happy path.
 # It is very sensitive to a single overwrite/tombstone and leads to timeouts 
in many simple use cases. This sensitivity is even worse for hybrid queries 
because the re-query logic forces us to search the boolean predicates again too.

I propose that we modify the execution path within SAI and decouple the SAI 
vector search from other search types. We can return iterators out of the 
vector index that are in score order, not Primary Key order. In doing so, we 
can create merge-able iterators that will give us a single score ordered 
iterator from which we can materialize one row at a time in the 
VectorTopKProcessor.

The solution removes the sensitivity to tombstones and properly handles updated 
vectors by making sure that the similarity score of the materialized vector 
matches the similarity score computed within the index.

I implemented this in the DataStax fork here 
[https://github.com/datastax/cassandra/pull/929|https://github.com/datastax/cassandra/pull/929.].
 The PR has quite a bit of detail and has been run in production for a while 
now, so it is hardened code. At that time, we saw hybrid query p99 latencies 
for some production workloads go from 2 seconds to 40 milliseconds, and for 
other workloads, we saw queries go from timing out to 200 milliseconds.

I have continued to develop on the design, so the work that I am proposing 
bringing in to SAI will be a bit of an updated version of 929.

I put together some mermaid diagrams to show how the patch affects query 
execution within SAI/vector: 
[https://github.com/datastax/cassandra/blob/main/src/java/org/apache/cassandra/index/sai/VECTOR.md|https://github.com/datastax/cassandra/blob/main/src/java/org/apache/cassandra/index/sai/VECTOR.md.].

Here is a sample of a test that does not work for the current vector 
implementation but will pass with my patch.
{code:java}
@Test
public void testUpdateVectorToWorseAndBetterPositions() throws Throwable
{
    createTable("CREATE TABLE %s (pk int, val vector<float, 2>, PRIMARY 
KEY(pk))");
    createIndex("CREATE CUSTOM INDEX ON %s(val) USING 'StorageAttachedIndex'");

    execute("INSERT INTO %s (pk, val) VALUES (0, [1.0, 2.0])");
    execute("INSERT INTO %s (pk, val) VALUES (1, [1.0, 3.0])");

    flush();
    execute("INSERT INTO %s (pk, val) VALUES (0, [1.0, 4.0])");

    beforeAndAfterFlush(() -> {
        assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] 
LIMIT 1"), row(1));
        assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] 
LIMIT 2"), row(1), row(0));
    });

    // And now update pk 1 to show that we can get 0 too
    execute("INSERT INTO %s (pk, val) VALUES (1, [1.0, 5.0])");

    beforeAndAfterFlush(() -> {
        assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] 
LIMIT 1"), row(0));
        assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] 
LIMIT 2"), row(0), row(1));
    });

    // And now update both PKs so that the stream of ranked rows is PKs: 0, 1, 
[1], 0, 1, [0], where the numbers
    // wrapped in brackets are the "real" scores of the vectors. This test 
makes sure that we correctly remove
    // PrimaryKeys from the updatedKeys map so that we don't accidentally 
duplicate PKs.
    execute("INSERT INTO %s (pk, val) VALUES (1, [1.0, 3.5])");
    execute("INSERT INTO %s (pk, val) VALUES (0, [1.0, 6.0])");

    beforeAndAfterFlush(() -> {
        assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] 
LIMIT 1"), row(1));
        assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] 
LIMIT 2"), row(1), row(0));
    });
} {code}
Due to the correctness issues and the high probability of timeouts, as well as 
the known quality of this change, I propose that this target the cassandra-5.0 
branch.


> Use similarity score ordered iterators in SAI vector search; fix handling of 
> updated vectors
> --------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-20086
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20086
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Feature/Vector Search
>            Reporter: Michael Marshall
>            Assignee: Michael Marshall
>            Priority: Normal
>             Fix For: 5.0.x, 5.x
>
>
> The current vector search implementation has the following execution path 
> within SAI:
>  # Get topk vectors in score order (desc) from each vector index.
>  # Sort in Primary Key order
>  # Pull all results into the VectorTopKProcessor, sort/truncate to queried 
> LIMIT
>  # StorageAttachedIndexSearcher checks if there were any shadowed rows, and 
> if there are, re-runs steps 1 through 3 until there are no additional 
> shadowed rows.
> There are several problems with this approach.
>  # It does not properly handle updated vectors, and can return low scoring 
> vectors.
>  # It materializes `limit * num segment indexes` rows on the happy path.
>  # It is very sensitive to a single overwrite/tombstone and leads to timeouts 
> in many simple use cases. This sensitivity is even worse for hybrid queries 
> because the re-query logic forces us to search the boolean predicates again 
> too.
> I propose that we modify the execution path within SAI and decouple the SAI 
> vector search from other search types. We can return iterators out of the 
> vector index that are in score order, not Primary Key order. In doing so, we 
> can create merge-able iterators that will give us a single score ordered 
> iterator from which we can materialize one row at a time in the 
> VectorTopKProcessor.
> The solution removes the sensitivity to tombstones and properly handles 
> updated vectors by making sure that the similarity score of the materialized 
> vector matches the similarity score computed within the index.
> I implemented this in the DataStax fork here 
> [https://github.com/datastax/cassandra/pull/929|https://github.com/datastax/cassandra/pull/929.].
>  The PR has quite a bit of detail and has been run in production for a while 
> now, so it is hardened code. At that time, we saw hybrid query p99 latencies 
> for some production workloads go from 2 seconds to 40 milliseconds, and for 
> other workloads, we saw queries go from timing out to 200 milliseconds.
> I have continued to develop on the design, so the work that I am proposing 
> bringing in to SAI will be a bit of an updated version of 929.
> I put together some mermaid diagrams to show how the patch affects query 
> execution within SAI/vector: 
> https://github.com/michaeljmarshall/cassandra/blob/vsearch/src/java/org/apache/cassandra/index/sai/VECTOR.md
> Here is a sample of a test that does not work for the current vector 
> implementation but will pass with my patch.
> {code:java}
> @Test
> public void testUpdateVectorToWorseAndBetterPositions() throws Throwable
> {
>     createTable("CREATE TABLE %s (pk int, val vector<float, 2>, PRIMARY 
> KEY(pk))");
>     createIndex("CREATE CUSTOM INDEX ON %s(val) USING 
> 'StorageAttachedIndex'");
>     execute("INSERT INTO %s (pk, val) VALUES (0, [1.0, 2.0])");
>     execute("INSERT INTO %s (pk, val) VALUES (1, [1.0, 3.0])");
>     flush();
>     execute("INSERT INTO %s (pk, val) VALUES (0, [1.0, 4.0])");
>     beforeAndAfterFlush(() -> {
>         assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] 
> LIMIT 1"), row(1));
>         assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] 
> LIMIT 2"), row(1), row(0));
>     });
>     // And now update pk 1 to show that we can get 0 too
>     execute("INSERT INTO %s (pk, val) VALUES (1, [1.0, 5.0])");
>     beforeAndAfterFlush(() -> {
>         assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] 
> LIMIT 1"), row(0));
>         assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] 
> LIMIT 2"), row(0), row(1));
>     });
>     // And now update both PKs so that the stream of ranked rows is PKs: 0, 
> 1, [1], 0, 1, [0], where the numbers
>     // wrapped in brackets are the "real" scores of the vectors. This test 
> makes sure that we correctly remove
>     // PrimaryKeys from the updatedKeys map so that we don't accidentally 
> duplicate PKs.
>     execute("INSERT INTO %s (pk, val) VALUES (1, [1.0, 3.5])");
>     execute("INSERT INTO %s (pk, val) VALUES (0, [1.0, 6.0])");
>     beforeAndAfterFlush(() -> {
>         assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] 
> LIMIT 1"), row(1));
>         assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] 
> LIMIT 2"), row(1), row(0));
>     });
> } {code}
> Due to the correctness issues and the high probability of timeouts, as well 
> as the known quality of this change, I propose that this target the 
> cassandra-5.0 branch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to