[ 
https://issues.apache.org/jira/browse/CASSANDRA-8272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16007899#comment-16007899
 ] 

Sylvain Lebresne commented on CASSANDRA-8272:
---------------------------------------------

I hadn't really looked at the patch before, but I think we can and should make 
most things index agnostic here. In particular, I don't think we should use 
{{Index#postProcessorFor}} for the coordinator-side filtering. What I would do 
instead is modify {{PartitionRangeReadCommand#ReconcialiationProcessing}} to be:
{noformat}
public PartitionIterator postReconciliationProcessing(PartitionIterator result)
{
    ColumnFamilyStore cfs = 
Keyspace.open(metadata().keyspace).getColumnFamilyStore(metadata().name);
    Index index = getIndex(cfs);
    if (index == null)
        return result;

    // Indexes on replica can return results that don't match the query but are 
necessary
    // to avoid stale entries from other nodes (see #8272) so we should filter 
those out
    // now. Then we apply any index specific post-processor.
    RowFilter indexFilter = index.getIndexQueryFilter(rowFilter());
    return index.postProcessorFor(this)
                .apply(indexFilter.filter(result, metadata(), nowInSec()), 
this);
}
{noformat}
where we'd just need to add a {{Index#getIndexQueryFilter}} that would be the 
exact inverse of {{Index#getPostIndexQueryFilter}} (and so this can be simply 
implemented as a default method that subtract from its input anything from 
{{Index#getPostIndexQueryFilter}}).

The reason I'm advocating this is two-fold:
# it's simpler in that it's done once generically instead of being 
re-implemented by each index implementation; and as index already expose 
everything we need to do this, no reason to have them do it "manually".
# probably more importantly, if we do decide to move all filtering 
coordinator-side in CASSANDRA-8273 (which I do believe is the right thing to 
do), we'd just have to modify (and actually simplify) that method slightly to
{noformat}
public PartitionIterator postReconciliationProcessing(PartitionIterator result)
{
    result = rowFilter().filter(result, metadata(), nowInSec());
    ColumnFamilyStore cfs = 
Keyspace.open(metadata().keyspace).getColumnFamilyStore(metadata().name);
    Index index = getIndex(cfs);
    if (index == null) ? result : index.postProcessorFor(this).apply(result, 
this);
}
{noformat}
which is imo kind of neat conceptually.

It is true that doing so does mean we need indexes to implement 
{{CustomExpression#isSatisfiedBy}}, which as I mention in my previous comment 
requires a slight refactor, but I'd argue that this is something we absolutely 
should do anyway (and should have done before). It's wrong that this method is 
currently basically broken for custom expressions: sure we currently happen to 
not use it in that case, but it's not future-proof at all and it's that kind of 
tribal knowledge ("you should make sure to not use {{Expression#isSatisfiedBy}} 
or anything that uses it if there is {{CustomExpression}} involved") that makes 
the code hard to work with.

bq. could be provided by a new method {{Index#getPostIndexQueryLimits}}, 
similar to the existing {{Index#getPostIndexQueryFilter}}.

That'd work, though we really don't need indexes to implement it, we can use 
the {{Index#getIndexQueryFilter}} from above. And so maybe we don't need to add 
a new method to {{Index}}, and instead just add a 
{{DataLimits#withFiltering(RowFilter)}} method (that given a {{DataLimits}}, 
create a new one that only count rows that match the provided {{RowFilter}}) 
and directly use that with the result of {{Index#getIndexQueryFilter}} when we 
need it.

bq. Alternatively, we might just disable the limits at 
{{ReadCommand#executeLocally}} and {{DataResolver#resolve}} for index queries, 
and let the indexes take care of restricting the limits at search time

I'll admit I'm kind of opposed to that. There is always cases where we need to 
be able to count stuffs post-query so not being able to do so with index 
queries would be clunky, if not properly a blocker. For instance, I suspect 
it'll break paging without quite a bit of special casing. Another example would 
be the row cache, where we do some stuffs around counting that just wouldn't 
work in that case (don't get me wrong, we don't use row cache for 2i today, I'm 
just trying to illustrate that we'd lose flexibility for future developments).

Further, it took some care in CASSANDRA-8099 to cleanly separate the counting 
of results from the actual producer of data but that imo simplified and cleaned 
thinks up, and I'd hate to start breaking that kind of "abstraction".


> 2ndary indexes can return stale data
> ------------------------------------
>
>                 Key: CASSANDRA-8272
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8272
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Sylvain Lebresne
>            Assignee: Andrés de la Peña
>             Fix For: 3.0.x
>
>
> When replica return 2ndary index results, it's possible for a single replica 
> to return a stale result and that result will be sent back to the user, 
> potentially failing the CL contract.
> For instance, consider 3 replicas A, B and C, and the following situation:
> {noformat}
> CREATE TABLE test (k int PRIMARY KEY, v text);
> CREATE INDEX ON test(v);
> INSERT INTO test(k, v) VALUES (0, 'foo');
> {noformat}
> with every replica up to date. Now, suppose that the following queries are 
> done at {{QUORUM}}:
> {noformat}
> UPDATE test SET v = 'bar' WHERE k = 0;
> SELECT * FROM test WHERE v = 'foo';
> {noformat}
> then, if A and B acknowledge the insert but C respond to the read before 
> having applied the insert, then the now stale result will be returned (since 
> C will return it and A or B will return nothing).
> A potential solution would be that when we read a tombstone in the index (and 
> provided we make the index inherit the gcGrace of it's parent CF), instead of 
> skipping that tombstone, we'd insert in the result a corresponding range 
> tombstone.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to