Sylvain Lebresne created CASSANDRA-8589:
-------------------------------------------
Summary: Reconciliation in presence of tombstone might yield state
data
Key: CASSANDRA-8589
URL: https://issues.apache.org/jira/browse/CASSANDRA-8589
Project: Cassandra
Issue Type: Bug
Reporter: Sylvain Lebresne
Consider 3 replica A, B, C (so RF=3) and consider that we do the following
sequence of actions at {{QUORUM}} where I indicate the replicas acknowledging
each operation (and let's assume that a replica that don't ack is a replica
that don't get the update):
{noformat}
CREATE TABLE test (k text, t int, v int, PRIMARY KEY (k, t))
INSERT INTO test(k, t, v) VALUES ('k', 0, 0); // acked by A, B and C
INSERT INTO test(k, t, v) VALUES ('k', 1, 1); // acked by A, B and C
INSERT INTO test(k, t, v) VALUES ('k', 2, 2); // acked by A, B and C
DELETE FROM test WHERE k='k' AND t=1; // acked by A and C
UPDATE test SET v = 3 WHERE k='k' AND t=2; // acked by B and C
SELECT * FROM test WHERE k='k' LIMIT 2; // answered by A and B
{noformat}
Every operation has achieved quorum, but on the last read, A will respond
{{0->0, tombstone 1, 2->2}} and B will respond {{0->0, 1->1}}. As a consequence
we'll answer {{0->0, 2->2}} which is incorrect (we should respond {{0->0,
2->3}}).
Put another way, if we have a limit, every replica honors that limit but since
tombstones can "suppress" results from other nodes, we may have some cells for
which we actually don't get a quorum of response (even though we globally have
a quorum of replica responses).
In practice, this probably occurs rather rarely and so the "simpler" fix is
probably to do something similar to the "short reads protection": detect when
this could have happen (based on how replica response are reconciled) and do an
additional request in that case. That detection will have potential false
positives but I suspect we can be precise enough that those false positives
will be very very rare (we should nonetheless track how often this code gets
triggered and if we see that it's more often than we think, we could
pro-actively bump user limits internally to reduce those occurrences).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)