[ 
https://issues.apache.org/jira/browse/CASSANDRA-8272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066670#comment-17066670
 ] 

Andres de la Peña edited comment on CASSANDRA-8272 at 3/25/20, 1:36 PM:
------------------------------------------------------------------------

It seems that there are some cases missed by the previous index tombstone based 
approach, 
 which is when the replica with the most recent version of a column has never 
seen the previous versions of that column that might be in other replicas, for 
example:
{code:java}
CREATE TABLE t (k int PRIMARY KEY, v text);
CREATE INDEX ON t(v);
INSERT INTO t(k, v) VALUES (0, 'old') USING TIMESTAMP 1;  // Only node 1 gets it
INSERT INTO t(k, v) VALUES (0, 'new') USING TIMESTAMP 2;  // Only node 2 gets it
SELECT * FROM t WHERE v = 'old'; // node 1 returns a stale result!
{code}
The attached PR proposes a different approach that is similar to short read 
protection, and also fixes CASSANDRA-8273.

When there is replica-side protection, we materialize and cache the query 
results, using a merge listener to take note of the primary keys of rows that 
doesn't have a response for any of the involved replicas. We know that those 
silent replicas might have a more recent version of the row that hasn't been 
included because it doesn't satisfy the filter. Once we have identified and 
collected those potentially stale rows, we ask for that rows to the silent 
replicas, with {{SinglePartitionReadCommand}} s that don't use any filtering. 
Then, we complete the cached filtered results with the responses from the 
silent replicas, apply the row filter, and we are ready to go.

Another advantage of this approach over the previous one is that coordinators 
containing the fix can work with replicas that don't contain the fix.

A particular problem is that SASI results don't satisfy the requested row 
filter when an analyzer is used. This is something that we should fix so the 
expressions could delegate their evaluation to the specific 
indexImplementation. I think this is not specially problematic but I think that 
it should be done in a separate follow up ticket. By now, the fix just skips 
replica filtering protection when SASI is used, keeping the old behaviour.

I'm attaching a PR for 3.11 and I'm working on the PR for trunk. The dtest PR 
is updated to include the new cases and queries using filtering instead of 
indexes.

Since this is a bug fix involving wrong query results, I think it would be 
great if we could ship it in 4.0.


was (Author: adelapena):
It seems that there are some cases missed by the previous index tombstone based 
approach, 
which is when the replica with the most recent version of a column has never 
seen the previous versions of that column that might be in other replicas, for 
example:

{code}
CREATE TABLE t (k int PRIMARY KEY, v text);
CREATE INDEX ON t(v);
INSERT INTO t(k, v) VALUES (0, 'old') USING TIMESTAMP 1;  // Only node 1 gets it
INSERT INTO t(k, v) VALUES (0, 'new') USING TIMESTAMP 2;  // Only node 2 gets it
SELECT * FROM t WHERE v = 'old'; // node 1 returns a stale result!
{code}

The attached PR proposes a different approach that is similar to short read 
protection, and also fixes CASSANDRA-8273.

When there is replica-side protection, we materialize and cache the query 
results, using a merge listener to take note of the primary keys of rows that 
doesn't have a response for any of the involved replicas. We know that those 
silent replicas might have a more recent version of the row that hasn't been 
included because it doesn't satisfy the filter. Once we have identified and 
collected those potentially stale rows, we ask for that rows to the silent 
replicas, with {{SinglePartitionReadCommand}} s that don't use any filtering. 
Then, we complete the cached filtered results with the responses from the 
silent replicas, apply the row filter, and we are ready to go.

Another advantage of this approach over the previous one is that coordinators 
containing the fix can work with replicas that don't contain the fix. 

A particular problem is that SASI results don't satisfy the requested row 
filter when an analyzer is used. This is something that we should fix so the 
expressions could delegate their evaluation to the specific 
indexImplementation. I think this is not specially problematic but I think that 
it should be done in a separate follow up ticket. By now, the fix just skips 
replica filtering protection when SASI is used, keeping the old behaviour.

I'm attaching a PR for 3.11 and I'm working on the PR for trunk. Since this is 
a bug fix involving wrong query results, I think it would be great if we could 
ship it in 4.0.


> 2ndary indexes can return stale data
> ------------------------------------
>
>                 Key: CASSANDRA-8272
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8272
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Feature/2i Index
>            Reporter: Sylvain Lebresne
>            Assignee: Andres de la Peña
>            Priority: Normal
>              Labels: pull-request-available
>             Fix For: 3.0.x
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> When replica return 2ndary index results, it's possible for a single replica 
> to return a stale result and that result will be sent back to the user, 
> potentially failing the CL contract.
> For instance, consider 3 replicas A, B and C, and the following situation:
> {noformat}
> CREATE TABLE test (k int PRIMARY KEY, v text);
> CREATE INDEX ON test(v);
> INSERT INTO test(k, v) VALUES (0, 'foo');
> {noformat}
> with every replica up to date. Now, suppose that the following queries are 
> done at {{QUORUM}}:
> {noformat}
> UPDATE test SET v = 'bar' WHERE k = 0;
> SELECT * FROM test WHERE v = 'foo';
> {noformat}
> then, if A and B acknowledge the insert but C respond to the read before 
> having applied the insert, then the now stale result will be returned (since 
> C will return it and A or B will return nothing).
> A potential solution would be that when we read a tombstone in the index (and 
> provided we make the index inherit the gcGrace of it's parent CF), instead of 
> skipping that tombstone, we'd insert in the result a corresponding range 
> tombstone.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to