[
https://issues.apache.org/jira/browse/CASSANDRA-8272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16006428#comment-16006428
]
Sylvain Lebresne commented on CASSANDRA-8272:
---------------------------------------------
bq. 1) Stick with the current approach! It's good and I do not think using
special tombstones would buy us anything.
It would solve the {{LIMIT}} issue for instance. It's also theoretically a bit
more efficient as won't ship full rows to the coordinator that it end up
discarding.
Not that I'm suggesting we use special tombstones: thinking about it more I
think it's actually broken in some cases where we delete/re-insert an entry in
rapid succession. Basically, the deletion could end up deleting a valid entry
post-re-insert if one of the node haven't seen that re-insert yet.
Besides, the {{LIMIT}} issue isn't that hard to fix and the performance impact
is unlikely big in most cases. But my main point is that on principle we should
be careful to look at the whole solution before comparing it to alternatives
and deciding which one we "stick with". I've seen simple solutions get pretty
messy once you fix all edge cases to the point that it wasn't the best solution
anymore.
bq. 3) Generalize the approach so we can fix filtering
I wouldn't make that a requirement to commit this ticket. Filtering is
genuinely orthogonal to indexing: it's as valid to use with and without
indexing, and the code for both is mainly orthogonal. It's in particular not
true that fixing this bug will be "invalidated when filtering is applied",
especially when 2i is used but filtering isn't (hopefully the most common case
in production since filtering is what it is).
Don't get wrong, both problems are certainly related, in that the underlying
problem is that replica can return stale data where up-to-date replica don't
return anything to indicate this is stale.
But in the indexing cases, we have more information in that we have the index
tombstones to know what entries might be stale on another node, so we know *a
relatively minimal* set of info (rows) to return to "fix" potentially stale
entries from other replica.
In the filtering case, we don't have similar information to help\[1\]: the
equivalent to the solution from [~adelapena]'s patch is to say that when a
replica sees a row they should filter out, they still return it in case it may
be needed to "fix" a stale replica. But doing so exactly amounts to not doing
any filtering replica-side and moving it all server-side, which is what I'm
suggesting in CASSANDRA-8273.
In other words, this ticket has a mostly-replica-side based solution but the
filtering one probably doesn't. That's enough difference imo to not wed
ourselves to fixing both problems in the same ticket, and to keep discussion
around filtering on CASSANDRA-8273 (doesn't mean we can't cross-reference of
course when appropriate).
bq. and any other indexing implementation (most notably SASI)
That I agree is something we should consider. Though tbh, I have doubts we can
have a solution that is completely index agnostic. What we need is that indexes
return any entries that _was_ recently valid but isn't anymore (up to gc_grace)
so that the result from any replica having the old entry can be properly
reconciled and skipped. That has to be something the index implementation
itself provides to us, and the best we can do is simply specify that index
implementations have to do that to be correct. Though I suspect not all custom
index implementations will be able to provided that at all in practice.
Don't know about SASI in particular though. I assume it kind of has to keep
tombstones for old entries in the first place (not sure how it handles deletes
otherwise) and if so, we should certainly update it to implement the new
requirement described above.
There is the question of the {{LIMIT}} problem. I think the proper way to fix
that is to create a new {{DataLimits}} that keep the {{RowFilter}} around and
that doesn't count entries that don't match it (which will have a small cost
btw, but I don't see an easy way out). In which case, any "normal" CQL
expression will be covered and that will include SASI for this particular part.
For custom expressions however, we currently unfornately have
{{RowFilter.CustomExpression#isSatisfiedBy()}} always return {{true}}
currently, so we'd have to make that method abstract and require index using
custom expressions to implement it (which is, strictly speaking, a breaking
change and implies 4.0 at this point; more on that below).
bq. I'd also suggest to only apply it to 3.11 onwards
One thing that hasn't been mentioned is that the fix has impact on upgrades.
Namely, in a mixed cluster, some replica will start to return invalid results
and if the coordinator isn't upgraded yet, it won't filter those, which means
we'll return invalid entries. More precisely, we may return up-to-date data but
that doesn't match the request at all, and I would argue that it's a more
serious problem that the one we're actually fixing here (returning slightly
stale data but that do match the query). Also, during the upgrade window, I
suspect it's much more likely to happen than this bug is in the first place. So
I'd argue that during upgrade, the cure is way worst than the disease\[2\].
Anyway, that's a problem to consider and ideally we'd want to avoid it. That
does mean we should consider starting to filter entries on index queries
coordinator-side in 3.0/3.11 (even though we never return them), and only do
the replica-side parts in 4.0, with a fat warning that you need to only upgrade
to 4.0 from a 3.X version that has the coordinator-side fix.
Worth noting that this doesn't entirely fly for index using custom indexes:
we'd need to have them implement the {{CustomExpression#isSatistiedBy}} method
in 3.X in that scheme since we need it for the coordinator-side filtering as
well, but making that method abstract in 3.X is, as said above, a breaking
change. So we might want to only make the method abstract in 4.0, but add
another fat warning that custom index should consider overriding the method in
3.X so later upgrades are correct (unless the index implementation can't
implement the new requirement I expressed above, in which case they can change
nothing and stay about as broken as they are now, which could be a trade-off).
\[1\]: we could have it if we were to keep historical values on updates for a
while, but that's a _massive_ change and is thus a no-go since we're talking
about fixing filtering, which is something we decourage in production in the
first place.
\[2\]: to the best of my knowledge (and we've been aware of this problem for a
while, so we'd have likely noticed), no user actually reported this problem. It
certainly doesn't mean that no-one has hit it, and in fact I'm sure some have,
but it means it's rare enough and his consequence mild enough that no-one
noticed (or not enough to report it). After all, we only return stale entries
for a tiny window and I don't think user rely on 2i consistency _that_
seriously (gut guess admitedly).
> 2ndary indexes can return stale data
> ------------------------------------
>
> Key: CASSANDRA-8272
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8272
> Project: Cassandra
> Issue Type: Bug
> Reporter: Sylvain Lebresne
> Assignee: Andrés de la Peña
> Fix For: 3.0.x
>
>
> When replica return 2ndary index results, it's possible for a single replica
> to return a stale result and that result will be sent back to the user,
> potentially failing the CL contract.
> For instance, consider 3 replicas A, B and C, and the following situation:
> {noformat}
> CREATE TABLE test (k int PRIMARY KEY, v text);
> CREATE INDEX ON test(v);
> INSERT INTO test(k, v) VALUES (0, 'foo');
> {noformat}
> with every replica up to date. Now, suppose that the following queries are
> done at {{QUORUM}}:
> {noformat}
> UPDATE test SET v = 'bar' WHERE k = 0;
> SELECT * FROM test WHERE v = 'foo';
> {noformat}
> then, if A and B acknowledge the insert but C respond to the read before
> having applied the insert, then the now stale result will be returned (since
> C will return it and A or B will return nothing).
> A potential solution would be that when we read a tombstone in the index (and
> provided we make the index inherit the gcGrace of it's parent CF), instead of
> skipping that tombstone, we'd insert in the result a corresponding range
> tombstone.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]