[jira] [Commented] (CASSANDRA-2494) Quorum reads are not consistent

Peter Schuller (JIRA) Sun, 17 Apr 2011 14:02:45 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13020867#comment-13020867
 ]


Peter Schuller commented on CASSANDRA-2494:
-------------------------------------------

As far as I can tell the consistency being asked for was never promised by 
Cassandra is in fact not expected.

The expected behavior of writes is that they propagate; the difference between 
ONE and QUORUM is just how many are required to receive a write prior to a 
return to the client with a successful error code. For reads, that means you 
may get lucky at ONE or you may get lucky at QUORUM; the positive guarantee is 
in the case of a *completing* QUORUM write followed by a QUORUM read.

So just to be clear, although I don't think this is what is being asked for: As 
far as I know, it has never been the case, nor the intent to promise, that a 
write which fails is guaranteed not to eventually complete. Simply "fixing" 
reads is not enough; by design the data will be replicated during read-repair 
and AES - this is how consistency is achieved in Cassandra.

However, it sounds like what is being asked for is not that they don't 
propagate in the event of a write "failure", but just that reads don't see the 
writes until they are sufficiently propagated to guarantee that any future 
QUORUM read will also see the data. I can understand that is desirable, in the 
sense of achieving monotonically forward-moving data as the benchmark/test from 
the e-mail thread does. Another way to look at is that maybe you never want to 
read data successfully prior to achieving a certain level of replication, in 
order to avoid a client ever seeing data that may suddenly go away due to e.g. 
a node failure in spite of said failure not exceeding the number of failures 
the cluster was designed to survive.

So the key point would be the bit about guaranteeing that any "future QUORUM 
read will see the data or data subsequently overwritten", and actively 
read-repairing and waiting for it to happen would take care of that. It would 
be important to ensure that the act of ensuring a quorum of nodes have seen the 
data is the important part; one should not await for a quorum to agree on the 
*current* version of the data as that would create potentially unbounded 
round-trips on hotly contended data.

Thing to consider: One might think about cases where read-repair is currently 
not done, like range slices, and how an implementation that requires read 
repair for consistency affects that.



> Quorum reads are not consistent
> -------------------------------
>
>                 Key: CASSANDRA-2494
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2494
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Sean Bridges
>
> As discussed in this thread,
> http://www.mail-archive.com/[email protected]/msg12421.html
> Quorum reads should be consistent.  Assume we have a cluster of 3 nodes 
> (X,Y,Z) and a replication factor of 3. If a write of N is committed to X, but 
> not Y and Z, then a read from X should not return N unless the read is 
> committed to at  least two nodes.  To ensure this, a read from X should wait 
> for an ack of the read repair write from either Y or Z before returning.
> Are there system tests for cassandra?  If so, there should be a test similar 
> to the original post in the email thread.  One thread should write 1,2,3... 
> at consistency level ONE.  Another thread should read at consistency level 
> QUORUM from a random host, and verify that each read is >= the last read.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2494) Quorum reads are not consistent

Reply via email to