replicas but to ensure we read at least one newest value as long as write
quorum succeeded beforehand and W+R N.
This is correct.
It's not that a quorum of nodes agree it's that a quorum of nodes participate.
If a quorum participate in both the write and read you are guaranteed that one
node was involved in both. The wikipedia definition helps here A quorum is the
minimum number of members of a deliberative assembly necessary to conduct the
business of that group http://en.wikipedia.org/wiki/Quorum
It's a two step process: First do we have enough people to make a decision?
Second following the rules what was the decision?
In C* the rule is to use the value with the highest time stamp. Not the value
with the highest number of votes. The red boxes on this slide are the
winning values
http://www.slideshare.net/aaronmorton/cassandra-does-what-code-mania-2012/67
(thinking one of my slides in that deck may have been misleading in the past).
In Riak the rule is to use Vector Clocks.
So
I agree that returning val4 is the right thing to do if quorum (two) nodes
among (node1,node2,node3) have the val4
Is incorrect.
We return the value with the highest time stamp returned from the nodes
involved in the read. Only one needs to have val4.
The heart of the problem
here is that the coordinator responds to a client request assuming that
the consistency has been achieved the moment is issues a row repair with the
super-set of the resolved value; without receiving acknowledgement on the
success of a repair from the replicas for a given consistency constraint.
and
My intuition behind saying this is because we
would respond to the client without the replicas having confirmed their
meeting the consistency requirement.
It is not necessary for the coordinator to wait.
Consider an example: The app has stopped writing to the cluster, for a certain
column nodes 1,2 and 3 have value:timestamp bar:2, bar:2 and foo:1
respectively. The last write was a successful CL QUORUM write of bar with
timestamp 2. However node 3 did acknowledge this write for some reason.
To make it interesting the commit log volume on node 3 is full. Mutations are
blocking in the commit log queue so any write on node 3 will timeout and fail,
but reads are still working. We could imagine this is why node 3 did not commit
bar:2
Some read examples, RR is not active:
1) Client reads from node 4 (a non replica) with CL QUOURM, request goes to
nodes 1 and 2. Both agree on bar as value.
2) Client reads from node 3 with CL QUORUM, request is processed locally and on
node 2.
* There is a digest mismatch
* Row Repair read runs to read from for nodes 2 and 3.
* The super set resolves to bar:2
* Node 3 (the coordinator) queues a delta write locally to write bar:2.
No other delta writes are sent.
* Node 3 returns bar:2 to the client
3) Client reads from node 3 at CL QUOURM. The same thing as (2) happens and
bar:2 is returned.
4) Client reads from node 2 at CL QUOURM, read goes to 2 and 3. Roughly the
same thing as (2) happens and bar:2 is returned.
5) Client reads from node 1 as CL ONE. Read happens locally only and returns
bar:2
6) Client reads from node 3 as CL ONE. Read happens locally only and returns
foo:1
So:
* A read CL QUOURM will always return bar:2 even if node 3 only has foo:1 on
disk.
* A read at CL ONE will return no value or any previous write.
The delta write from the Row Repair goes to a single node so R + W N cannot
be applied. It can almost be thought of as internal implementation. The delta
write from a Digest Mismatch, HH writes, full RR writes and nodetool repair are
used to:
* Reduce the chance of a Digest Mismatch when CL ONE
* Eventually reach a state where reads at any CL return the last write.
They are not used to ensure strong consistency when R + W N. You could turn
those things off and R + W N would still work.
Hope that helps.
-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com
On 26/10/2012, at 7:15 AM, shankarpnsn shankarp...@gmail.com wrote:
manuzhang wrote
read quorum doesn't mean we read newest values from a quorum number of
replicas but to ensure we read at least one newest value as long as write
quorum succeeded beforehand and W+R N.
I beg to differ here. Any read/write, by definition of quorum, should have
at least n/2 + 1 replicas that agree on that read/write value. Responding to
the user with a newer value, even if the write creating the new value hasn't
completed cannot guarantee any read consistency 1.
Hiller, Dean wrote
Kind of an interesting question
I think you are saying if a client read resolved only the two nodes as
said in Aaron's email back to the client and read -repair was kicked off
because of the inconsistent values and the write did not complete yet and
I guess you would have two nodes go down to lose the value right after
the