[jira] [Commented] (CASSANDRA-8272) 2ndary indexes can return stale data

Sylvain Lebresne (JIRA) Thu, 11 May 2017 06:35:27 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-8272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16006428#comment-16006428
 ]


Sylvain Lebresne commented on CASSANDRA-8272:
---------------------------------------------

bq. 1) Stick with the current approach! It's good and I do not think using 
special tombstones would buy us anything.

It would solve the {{LIMIT}} issue for instance. It's also theoretically a bit 
more efficient as won't ship full rows to the coordinator that it end up 
discarding.

Not that I'm suggesting we use special tombstones: thinking about it more I 
think it's actually broken in some cases where we delete/re-insert an entry in 
rapid succession. Basically, the deletion could end up deleting a valid entry 
post-re-insert if one of the node haven't seen that re-insert yet.

Besides, the {{LIMIT}} issue isn't that hard to fix and the performance impact 
is unlikely big in most cases. But my main point is that on principle we should 
be careful to look at the whole solution before comparing it to alternatives 
and deciding which one we "stick with". I've seen simple solutions get pretty 
messy once you fix all edge cases to the point that it wasn't the best solution 
anymore.

bq. 3) Generalize the approach so we can fix filtering

I wouldn't make that a requirement to commit this ticket. Filtering is 
genuinely orthogonal to indexing: it's as valid to use with and without 
indexing, and the code for both is mainly orthogonal. It's in particular not 
true that fixing this bug will be "invalidated when filtering is applied", 
especially when 2i is used but filtering isn't (hopefully the most common case 
in production since filtering is what it is).

Don't get wrong, both problems are certainly related, in that the underlying 
problem is that replica can return stale data where up-to-date replica don't 
return anything to indicate this is stale.

But in the indexing cases, we have more information in that we have the index 
tombstones to know what entries might be stale on another node, so we know *a 
relatively minimal* set of info (rows) to return to "fix" potentially stale 
entries from other replica.

In the filtering case, we don't have similar information to help\[1\]: the 
equivalent to the solution from [~adelapena]'s patch is to say that when a 
replica sees a row they should filter out, they still return it in case it may 
be needed to "fix" a stale replica. But doing so exactly amounts to not doing 
any filtering replica-side and moving it all server-side, which is what I'm 
suggesting in CASSANDRA-8273.

In other words, this ticket has a mostly-replica-side based solution but the 
filtering one probably doesn't. That's enough difference imo to not wed 
ourselves to fixing both problems in the same ticket, and to keep discussion 
around filtering on CASSANDRA-8273 (doesn't mean we can't cross-reference of 
course when appropriate).

bq. and any other indexing implementation (most notably SASI)

That I agree is something we should consider. Though tbh, I have doubts we can 
have a solution that is completely index agnostic. What we need is that indexes 
return any entries that _was_ recently valid but isn't anymore (up to gc_grace) 
so that the result from any replica having the old entry can be properly 
reconciled and skipped. That has to be something the index implementation 
itself provides to us, and the best we can do is simply specify that index 
implementations have to do that to be correct. Though I suspect not all custom 
index implementations will be able to provided that at all in practice.

Don't know about SASI in particular though. I assume it kind of has to keep 
tombstones for old entries in the first place (not sure how it handles deletes 
otherwise) and if so, we should certainly update it to implement the new 
requirement described above.

There is the question of the {{LIMIT}} problem. I think the proper way to fix 
that is to create a new {{DataLimits}} that keep the {{RowFilter}} around and 
that doesn't count entries that don't match it (which will have a small cost 
btw, but I don't see an easy way out). In which case, any "normal" CQL 
expression will be covered and that will include SASI for this particular part. 
For custom expressions however, we currently unfornately have 
{{RowFilter.CustomExpression#isSatisfiedBy()}} always return {{true}} 
currently, so we'd have to make that method abstract and require index using 
custom expressions to implement it (which is, strictly speaking, a breaking 
change and implies 4.0 at this point; more on that below).


bq. I'd also suggest to only apply it to 3.11 onwards

One thing that hasn't been mentioned is that the fix has impact on upgrades. 
Namely, in a mixed cluster, some replica will start to return invalid results 
and if the coordinator isn't upgraded yet, it won't filter those, which means 
we'll return invalid entries. More precisely, we may return up-to-date data but 
that doesn't match the request at all, and I would argue that it's a more 
serious problem that the one we're actually fixing here (returning slightly 
stale data but that do match the query). Also, during the upgrade window, I 
suspect it's much more likely to happen than this bug is in the first place. So 
I'd argue that during upgrade, the cure is way worst than the disease\[2\].

Anyway, that's a problem to consider and ideally we'd want to avoid it. That 
does mean we should consider starting to filter entries on index queries 
coordinator-side in 3.0/3.11 (even though we never return them), and only do 
the replica-side parts in 4.0, with a fat warning that you need to only upgrade 
to 4.0 from a 3.X version that has the coordinator-side fix.

Worth noting that this doesn't entirely fly for index using custom indexes: 
we'd need to have them implement the {{CustomExpression#isSatistiedBy}} method 
in 3.X in that scheme since we need it for the coordinator-side filtering as 
well, but making that method abstract in 3.X is, as said above, a breaking 
change. So we might want to only make the method abstract in 4.0, but add 
another fat warning that custom index should consider overriding the method in 
3.X so later upgrades are correct (unless the index implementation can't 
implement the new requirement I expressed above, in which case they can change 
nothing and stay about as broken as they are now, which could be a trade-off).

\[1\]: we could have it if we were to keep historical values on updates for a 
while, but that's a _massive_ change and is thus a no-go since we're talking 
about fixing filtering, which is something we decourage in production in the 
first place.
\[2\]: to the best of my knowledge (and we've been aware of this problem for a 
while, so we'd have likely noticed), no user actually reported this problem. It 
certainly doesn't mean that no-one has hit it, and in fact I'm sure some have, 
but it means it's rare enough and his consequence mild enough that no-one 
noticed (or not enough to report it). After all, we only return stale entries 
for a tiny window and I don't think user rely on 2i consistency _that_ 
seriously (gut guess admitedly).



> 2ndary indexes can return stale data
> ------------------------------------
>
>                 Key: CASSANDRA-8272
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8272
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Sylvain Lebresne
>            Assignee: Andrés de la Peña
>             Fix For: 3.0.x
>
>
> When replica return 2ndary index results, it's possible for a single replica 
> to return a stale result and that result will be sent back to the user, 
> potentially failing the CL contract.
> For instance, consider 3 replicas A, B and C, and the following situation:
> {noformat}
> CREATE TABLE test (k int PRIMARY KEY, v text);
> CREATE INDEX ON test(v);
> INSERT INTO test(k, v) VALUES (0, 'foo');
> {noformat}
> with every replica up to date. Now, suppose that the following queries are 
> done at {{QUORUM}}:
> {noformat}
> UPDATE test SET v = 'bar' WHERE k = 0;
> SELECT * FROM test WHERE v = 'foo';
> {noformat}
> then, if A and B acknowledge the insert but C respond to the read before 
> having applied the insert, then the now stale result will be returned (since 
> C will return it and A or B will return nothing).
> A potential solution would be that when we read a tombstone in the index (and 
> provided we make the index inherit the gcGrace of it's parent CF), instead of 
> skipping that tombstone, we'd insert in the result a corresponding range 
> tombstone.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-8272) 2ndary indexes can return stale data

Reply via email to