[ 
https://issues.apache.org/jira/browse/CASSANDRA-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697908#action_12697908
 ] 

Sandeep Tata commented on CASSANDRA-48:
---------------------------------------

There's a problem with my previous example. Here it is again, slightly improved:

Process1:
Read X 
T&S X +1 , timestamp=5

Process2:
Read X 
T&S X +2, timestamp=4

The two T&S are routed to A , B, C: 

@A T&S X + 1 
                            @B T&S X+2 
                            @C T&S X+2 
                            @A T&S X+2 <------Fail, A already updated X 
@B T&S X+1 <---- fail 
@C T&S X+2 <---- fail 

The node that coordinated the T&S for Process1 reports a failure because 2 out 
of 3 T&S attempts failed.
Process2 reports success because 1 out of 3 failed, 2 succeeded.
The correct value at the node is now X+2 (from Process2)

Now, if node C dies, here's what we have on A & B:

@A: X+1, ts=5
@B: X+2, ts=4

We can't resolve this using majority (we don't do that currently anyway). The 
latest timestamp rule will give us the wrong answer, and we'll lose the update 
from Process2. Bad.

If the value at A was fixed before C went down, we wouldn't be in this trouble:

The node coordinating the T&S should always "cancel" the successful (minority) 
writes from a failed attempt. A simple way to implement cancelled writes would 
be to add a bit for that column indicating that the column is "cancelled". The 
cancelling write should also use T&S to avoid trampling on a newer value if the 
old invalid value was replaced in the meantime. If the cancelling T&S fails, we 
have nothing to worry about. When a node sees a cancelled column, it should 
replace that value with the value from a peer even if the peer node sends a 
value with an older timestamp. This would involve 1) a simple change to read 
repair logic, 2) change to the read path to make sure cancelled columns are not 
read (if a read results in a cancelled column, then you trigger quorum read), 
and 3) modifying ColumnFamily to understand cancelled writes.

The cancel-writes approach is cleaner than a forced rollback to an older value 
by removing this bad value from the memtable. At first glance it seems simpler, 
but what if the memtable got flushed in the meantime? You can't really remove 
this bad value from the SSTable.

What happens with Hinted-handoff:
If you don't hear back from any of the intended peers, the T&S is assumed 
failed. Handing off T&S doesn't make sense because it requires a success 
quorum. The failed node can come back up and get the correct values using read 
repair -- business as usual.

Of course, this still leaves us with a window of vulnerability. If C goes down 
(in the example above) *and* the node running Process1 dies before it has a 
chance to fix A, we still end up with an invalid update. But I'm guessing this 
is a good tradeoff to make in a system that relies on eventual consistency. 
This becomes a problem if C happens to be the node running Process1. The 
alternative would be a protocol based on tentative writes and confirmations -- 
that brings in a whole slew of other new semantics and other complications.

I'm going to scan the literature to see if there are alternative protocols in 
this consistency class  -- any thoughts on this approach? Any alternatives?

> test-and-set
> ------------
>
>                 Key: CASSANDRA-48
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-48
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jonathan Ellis
>
> Atomic test-and-set insert operation would be nice: "set value to X but only 
> if the current value is still Y."  This allows a sort of optimistic 
> consistency: perform a GET, then perform test-and-set with the value of that 
> GET as Y.
> I do not think that this requires strong consistency to be useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to