[
https://issues.apache.org/jira/browse/CASSANDRA-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14371791#comment-14371791
]
Sylvain Lebresne commented on CASSANDRA-8933:
---------------------------------------------
Turns out this example is not a problem out of well, sheer luck. In
{{SliceQueryFilter.collectReducedColumns}}, the stop condition is
{{columnCounter.live() > count}}, which in practice means that even when a
replica has found has many results as requested, it will still include every
tombstones until it find the new live results. The only reason it was done this
way is that for CQL we have to group cells into rows for counting and the
condition used is an easy way to make sure we don't stop before having included
every cell of our last row. The fact it includes tombstones afterwards is to
some extend an inefficiency which we have been too lazy but turns out that it
avoids the problem I've described above.
I'll take a few minutes to think if there is other cases that could still be
problematic but it's actually possible that it's not a problem (it probably was
one before 2.0).
> Short reads can return deleted results
> --------------------------------------
>
> Key: CASSANDRA-8933
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8933
> Project: Cassandra
> Issue Type: Bug
> Reporter: Sylvain Lebresne
>
> The current code for short reads protection does not handle all cases.
> Currently, we retry only if a node had returned the requested number of
> results, but we have less results than that post-reconciliation, because this
> means the node in question may have more results it hadn't sent due to the
> limit.
> Consider however 3 nodes A, B, C (RF=3), and following sequence of operations
> (all done at QUORUM):
> # we write 1 and 2 in a partition: all nodes get it.
> # we delete 1: only A and C get it.
> # we delete 2: only B and C get it.
> # we read the first row in the partition (so with a LIMIT 1) and A and B
> answer first.
> At the last step, A will return the tombstone for 1 and the value 2, while B
> will return just 1. So post reconciliation, we'll return 2 (since A returned
> it and we have no tombstone for it), while we should return nothing. This is
> a short read situation: B stopped at 1 because it was asked only 1 result,
> but that result didn't made it in the result and we need further results from
> it. However, Because 1 results is requested and we have 1 result
> post-reconciliation, the short read retry won't kick in.
> In practice, the short read check should be generalized: if any node X
> returns the requested number of results but any of those results gets skipped
> post-reconciliation, we might have a short read. Basically, enforcing the
> limit replica-side is optimistic and assumes that all results of that replica
> will be used, and as soon as that assumption fails we should get back more
> results.
> Implementing that generalized condition can probably be done in
> RowDataResolver.scheduleRepairs by using the repair to know if a node has had
> some of results skipped by reconciliation but we want to know if a full CQL
> row has been skipped or not so this will probably force us to add some
> recounting.
> I'll note that I've fixed this problem on my branch for CASSANDRA-8099 (where
> this is both simpler and somewhat more efficient since short reads don't
> retry full queries there), so if decide this is too risky to fix in 2.1, we
> can possibly just mark this as duplicate of CASSANDRA-8099.
> Lastly, it shouldn't be too hard to extends our current short read dtests to
> test for that case, but I haven't taken the time to do so yet
> ([~philipthompson] do you think you can have a look at adding such test at
> some point?).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)