Sylvain Lebresne created CASSANDRA-8589:
-------------------------------------------

             Summary: Reconciliation in presence of tombstone might yield state 
data
                 Key: CASSANDRA-8589
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8589
             Project: Cassandra
          Issue Type: Bug
            Reporter: Sylvain Lebresne


Consider 3 replica A, B, C (so RF=3) and consider that we do the following 
sequence of actions at {{QUORUM}} where I indicate the replicas acknowledging 
each operation (and let's assume that a replica that don't ack is a replica 
that don't get the update):
{noformat}
CREATE TABLE test (k text, t int, v int, PRIMARY KEY (k, t))

INSERT INTO test(k, t, v) VALUES ('k', 0, 0); // acked by A, B and C
INSERT INTO test(k, t, v) VALUES ('k', 1, 1); // acked by A, B and C
INSERT INTO test(k, t, v) VALUES ('k', 2, 2); // acked by A, B and C

DELETE FROM test WHERE k='k' AND t=1;         // acked by A and C

UPDATE test SET v = 3 WHERE k='k' AND t=2;    // acked by B and C

SELECT * FROM test WHERE k='k' LIMIT 2;       // answered by A and B
{noformat}
Every operation has achieved quorum, but on the last read, A will respond 
{{0->0, tombstone 1, 2->2}} and B will respond {{0->0, 1->1}}. As a consequence 
we'll answer {{0->0, 2->2}} which is incorrect (we should respond {{0->0, 
2->3}}).

Put another way, if we have a limit, every replica honors that limit but since 
tombstones can "suppress" results from other nodes, we may have some cells for 
which we actually don't get a quorum of response (even though we globally have 
a quorum of replica responses).

In practice, this probably occurs rather rarely and so the "simpler" fix is 
probably to do something similar to the "short reads protection": detect when 
this could have happen (based on how replica response are reconciled) and do an 
additional request in that case. That detection will have potential false 
positives but I suspect we can be precise enough that those false positives 
will be very very rare (we should nonetheless track how often this code gets 
triggered and if we see that it's more often than we think, we could 
pro-actively bump user limits internally to reduce those occurrences).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to