[
https://issues.apache.org/jira/browse/CASSANDRA-8589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14394530#comment-14394530
]
Sylvain Lebresne commented on CASSANDRA-8589:
---------------------------------------------
It would actually be nice to start by ensuring we can reproduce it through a
dtest. It shoudn't be too hard to write one, and no point in chasing a complex
solution if like for CASSANDRA-8933, something I forgot about in the code made
this not a problem. Also, CASSANDRA-8099 should actually solve that, so if
that's confirmed by said reproduction dtest, maybe we're good with fixing in
3.0 only.
> Reconciliation in presence of tombstone might yield state data
> --------------------------------------------------------------
>
> Key: CASSANDRA-8589
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8589
> Project: Cassandra
> Issue Type: Bug
> Reporter: Sylvain Lebresne
>
> Consider 3 replica A, B, C (so RF=3) and consider that we do the following
> sequence of actions at {{QUORUM}} where I indicate the replicas acknowledging
> each operation (and let's assume that a replica that don't ack is a replica
> that don't get the update):
> {noformat}
> CREATE TABLE test (k text, t int, v int, PRIMARY KEY (k, t))
> INSERT INTO test(k, t, v) VALUES ('k', 0, 0); // acked by A, B and C
> INSERT INTO test(k, t, v) VALUES ('k', 1, 1); // acked by A, B and C
> INSERT INTO test(k, t, v) VALUES ('k', 2, 2); // acked by A, B and C
> DELETE FROM test WHERE k='k' AND t=1; // acked by A and C
> UPDATE test SET v = 3 WHERE k='k' AND t=2; // acked by B and C
> SELECT * FROM test WHERE k='k' LIMIT 2; // answered by A and B
> {noformat}
> Every operation has achieved quorum, but on the last read, A will respond
> {{0->0, tombstone 1, 2->2}} and B will respond {{0->0, 1->1}}. As a
> consequence we'll answer {{0->0, 2->2}} which is incorrect (we should respond
> {{0->0, 2->3}}).
> Put another way, if we have a limit, every replica honors that limit but
> since tombstones can "suppress" results from other nodes, we may have some
> cells for which we actually don't get a quorum of response (even though we
> globally have a quorum of replica responses).
> In practice, this probably occurs rather rarely and so the "simpler" fix is
> probably to do something similar to the "short reads protection": detect when
> this could have happen (based on how replica response are reconciled) and do
> an additional request in that case. That detection will have potential false
> positives but I suspect we can be precise enough that those false positives
> will be very very rare (we should nonetheless track how often this code gets
> triggered and if we see that it's more often than we think, we could
> pro-actively bump user limits internally to reduce those occurrences).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)