[jira] [Commented] (CASSANDRA-4989) Expose new SliceQueryFilter features through Thrift interface

2013-03-25 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13612553#comment-13612553
 ] 

Cristian Opris commented on CASSANDRA-4989:
---

Yes, Sylvain is correct. This is essentially an optimization to avoid 
iterating through the columns and just get the latest group that has a common 
prefix. I noticed this can be done with the new SliceQueryFilter so it would be 
useful if it can be exposed.

If I'm allowed to go off on a tangent here (I know, not the best place) having 
more pluggable behaviour would be an interesting direction to take with 
Cassandra. Same way it's possible to have custom column comparators, maybe we 
could have pluggable row level indexes, pluggable queries to use them, 
pluggable notification systems, etc. I know this has been discussed before, 
just wanted to add my vote here.

Thanks

 Expose new SliceQueryFilter features through Thrift interface
 -

 Key: CASSANDRA-4989
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4989
 Project: Cassandra
  Issue Type: Improvement
  Components: API
Affects Versions: 1.2.0, 1.2.1, 2.0
Reporter: Cristian Opris

 SliceQueryFilter has some very useful new features like ability to specify a 
 composite column prefix to group by and specify a limit of groups to return.
 This is very useful if for example I have a wide row with columns prefixed by 
 timestamp and I want to retrieve the latest columns, but I don't know the 
 column names. Say I have a row
 {{row - (t1, c1), (t1, c2)... (t1, cn) ... (t0,c1) ... etc}}
 Query slice range (t1,) group by prefix (1) limit (1)
 As a more general question, is the Thrift interface going to be kept 
 up-to-date with the feature changes or will it be left behind (a mistake IMO) 
 ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5062) Support CAS

2013-03-05 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13593457#comment-13593457
 ] 

Cristian Opris commented on CASSANDRA-5062:
---

I understand what you mean by terminology but this is not where the confusion 
is coming from.

My commit C1,C2 etc is your learn, agreed. My accept is your commit.

It may be a bit confusing because I'm not detailing everything in the diagram

So when Z goes into C1, that implies: it receives accept from Y, it commits 
(i.e. writes) the value locally
and then it sends learn message to X and Y, which might fail without Z having 
any record of that.

I know this is not the exact behaviour in your algoritm. I'm not sure how the 
leader commits (learns) the value locally, is it because it ends 
up calling receive(LEARN) locally (i.e. acting as acceptor as well) ?

But this doesn't change my point.

*My point is the learn can fail without the leader being aware, which leads to 
a state where each replica is at a different 
stage of learning. Even if the paxos round states are correct in terms of 
accepted values (what you call commit), they are not finished in 
terms of learning*


 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0

 Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, 
 half-baked commit 3.jpg


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5062) Support CAS

2013-03-04 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13592282#comment-13592282
 ] 

Cristian Opris commented on CASSANDRA-5062:
---

FWIW, just to clarify my own examples, I can't speak for Jonathan: *version 
counter or most recent commit is NOT the paxos proposal number*. The Paxos 
proposal number I've ommitted in most of my examples except for the last more 
detailed one. Timeuuid is fine for proposal number.

Also with regard to logging/no logging. I believe you only need to keep a log 
if you plan to replicate operations rather than state. 
Transfering state (as we discussed so far) does not require a log but makes it 
impractical to replicate large values, so this is the main trade off, I don't 
believe it's got anything to do with paxos.

 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0

 Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, 
 half-baked commit 3.jpg


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5062) Support CAS

2013-03-04 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13592559#comment-13592559
 ] 

Cristian Opris commented on CASSANDRA-5062:
---

[~slebresne] I have read your pseudo-code, seems pretty much what I was trying 
to describe with the version counter that counts paxos rounds (except I was 
thinking at row level rather than column level)

I noticed however that while the leader's proposal is aborted if it has a stale 
round, the acceptor algorithm does not handle the case when the 
acceptor replica is behind.

Basically in the acceptor algorithm you don't seem to handle the case where 
C_current.timestamp()  R

One way to do that is to nack the proposal indicating it needs to catch up and 
either expect to receive a snapshot from the leader or do a read.

Also note you don't need to send the column values with proposal. If you get 
quorum for the proposal you can perform the CAS locally and just
send the new column value.

Essentially consensus is on the next column value to write, not the CAS. Since 
proposer is guaranteed to be up to date before sending accept, 
it can do the CAS locally. 


 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0

 Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, 
 half-baked commit 3.jpg


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (CASSANDRA-5062) Support CAS

2013-03-04 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13592559#comment-13592559
 ] 

Cristian Opris edited comment on CASSANDRA-5062 at 3/4/13 7:48 PM:
---

[~slebresne] I have read your pseudo-code, seems pretty much what I was trying 
to describe with the version counter that counts paxos rounds (except I was 
thinking at row level rather than column level)

I noticed however that while the leader's proposal is aborted if it has a stale 
round, the acceptor algorithm does not handle the case when the 
acceptor replica is behind.

Basically in the acceptor algorithm you don't seem to handle the case where 
C_current.timestamp()  R

One way to do that is to nack the proposal indicating it needs to catch up and 
either expect to receive a snapshot from the leader or do a read.

Also note you don't need to send the column values with the proposal. If you 
get quorum for the proposal you can perform the CAS locally and just
send the new column value with the accept

Essentially consensus is on the next column value to write, not the CAS. Since 
proposer is guaranteed to be up to date before sending accept, 
it can do the CAS locally. 


  was (Author: onetoinfin...@yahoo.com):
[~slebresne] I have read your pseudo-code, seems pretty much what I was 
trying to describe with the version counter that counts paxos rounds (except I 
was thinking at row level rather than column level)

I noticed however that while the leader's proposal is aborted if it has a stale 
round, the acceptor algorithm does not handle the case when the 
acceptor replica is behind.

Basically in the acceptor algorithm you don't seem to handle the case where 
C_current.timestamp()  R

One way to do that is to nack the proposal indicating it needs to catch up and 
either expect to receive a snapshot from the leader or do a read.

Also note you don't need to send the column values with proposal. If you get 
quorum for the proposal you can perform the CAS locally and just
send the new column value.

Essentially consensus is on the next column value to write, not the CAS. Since 
proposer is guaranteed to be up to date before sending accept, 
it can do the CAS locally. 

  
 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0

 Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, 
 half-baked commit 3.jpg


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (CASSANDRA-5062) Support CAS

2013-03-04 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13592559#comment-13592559
 ] 

Cristian Opris edited comment on CASSANDRA-5062 at 3/4/13 7:52 PM:
---

[~slebresne] I have read your pseudo-code, seems pretty much what I was trying 
to describe with the version counter that counts paxos rounds (except I was 
thinking at row level rather than column level)

I noticed however that while the leader's proposal is aborted if it has a stale 
round, the acceptor algorithm does not handle the case when the 
acceptor replica is behind.

Basically in the acceptor algorithm you don't seem to handle the case where 
C_current.timestamp()  R-1

Edit: C_current.timestamp needs to be exactly R-1 if you increment the counter 
on sending the proposal.

One way to do that is to nack the proposal indicating it needs to catch up and 
either expect to receive a snapshot from the leader or do a read.

Also note you don't need to send the column values with the proposal. If you 
get quorum for the proposal you can perform the CAS locally and just
send the new column value with the accept

Essentially consensus is on the next column value to write, not the CAS. Since 
proposer is guaranteed to be up to date before sending accept, 
it can do the CAS locally. 


  was (Author: onetoinfin...@yahoo.com):
[~slebresne] I have read your pseudo-code, seems pretty much what I was 
trying to describe with the version counter that counts paxos rounds (except I 
was thinking at row level rather than column level)

I noticed however that while the leader's proposal is aborted if it has a stale 
round, the acceptor algorithm does not handle the case when the 
acceptor replica is behind.

Basically in the acceptor algorithm you don't seem to handle the case where 
C_current.timestamp()  R

One way to do that is to nack the proposal indicating it needs to catch up and 
either expect to receive a snapshot from the leader or do a read.

Also note you don't need to send the column values with the proposal. If you 
get quorum for the proposal you can perform the CAS locally and just
send the new column value with the accept

Essentially consensus is on the next column value to write, not the CAS. Since 
proposer is guaranteed to be up to date before sending accept, 
it can do the CAS locally. 

  
 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0

 Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, 
 half-baked commit 3.jpg


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5062) Support CAS

2013-03-04 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13592634#comment-13592634
 ] 

Cristian Opris commented on CASSANDRA-5062:
---

Sorry, I've probably edited my comment after your reply.

C_current.timestamp needs to be exactly R-1 if you increment the counter on 
sending the proposal.

If it's less than the acceptor hasn't learned the previously committed value 
(R-1) so can't participate in round R, otherwise we're mixing up rounds.

If it's more, than the proposer is behind so you already handle that.


Regarding  If you get quorum for the proposal you can perform the CAS locally 
and just
send the new column value with the accept

By that I meant you can do the validate part of the CAS locally, not actually 
write the CAS. 

Basically any operation (not just CAS) can be evaluated (in memory) by the 
proposal after it gets quorum for the proposal (which guarantees it has the 
latest *committed* value) so it obtains the value to send for acceptance. This 
is more of an optimization where you exchange and agree on values rather than 
operations (state transfer replication). Also solves the problem of where to 
validate the CAS.



 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0

 Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, 
 half-baked commit 3.jpg


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (CASSANDRA-5062) Support CAS

2013-03-04 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13592634#comment-13592634
 ] 

Cristian Opris edited comment on CASSANDRA-5062 at 3/4/13 9:05 PM:
---

Sorry, I've probably edited my comment after your reply.

C_current.timestamp needs to be exactly R-1 if you increment the counter on 
sending the proposal.

If it's less, then the acceptor hasn't learned the previously committed value 
(R-1) so can't participate in round R, otherwise we're mixing up rounds.

If it's more, then the proposer is behind so you already handle that.


Regarding  If you get quorum for the proposal you can perform the CAS locally 
and just
send the new column value with the accept

By that I meant you can do the validate part of the CAS locally, not actually 
write the CAS. 

Basically any operation (not just CAS) can be evaluated (in memory) by the 
proposal after it gets quorum for the proposal (which guarantees it has the 
latest *committed* value) so it obtains the value to send for acceptance. This 
is more of an optimization where you exchange and agree on values rather than 
operations (state transfer replication). Also solves the problem of where to 
validate the CAS.



  was (Author: onetoinfin...@yahoo.com):
Sorry, I've probably edited my comment after your reply.

C_current.timestamp needs to be exactly R-1 if you increment the counter on 
sending the proposal.

If it's less than the acceptor hasn't learned the previously committed value 
(R-1) so can't participate in round R, otherwise we're mixing up rounds.

If it's more, than the proposer is behind so you already handle that.


Regarding  If you get quorum for the proposal you can perform the CAS locally 
and just
send the new column value with the accept

By that I meant you can do the validate part of the CAS locally, not actually 
write the CAS. 

Basically any operation (not just CAS) can be evaluated (in memory) by the 
proposal after it gets quorum for the proposal (which guarantees it has the 
latest *committed* value) so it obtains the value to send for acceptance. This 
is more of an optimization where you exchange and agree on values rather than 
operations (state transfer replication). Also solves the problem of where to 
validate the CAS.


  
 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0

 Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, 
 half-baked commit 3.jpg


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5062) Support CAS

2013-03-04 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13592743#comment-13592743
 ] 

Cristian Opris commented on CASSANDRA-5062:
---

Say you have this: 

Proposer has committed R-1, starts round R, proposal timestamp Tn

Acceptor recovers with committed R-n  R-1, and has accepted value A at R-n+1  
R-1 at Tm in the paxos state log.

When Acceptor receives proposal, if it doesn't check R, if Tm  Tn (clock 
mismatch) according to paxos it needs to send it's old accepted value and the 
proposer will have to use it to commit. It will end up committing an old value.

It's an edge case but not impossible. Paxos holds within the same round, but 
not across rounds.

This makes sense because a Paxos round just means agree on a value which once 
accepted by a quorum
can never change.

Which is why you can't have an out of date replica participate in a round.

The idea is to move from quorum that committed (learned) R to quorum that 
accepts R+1 to quorum that commits R+1 and so on. Note the quorums don't need 
to be made of same components.

To ensure this you maintain the invariant that *you can't propose or accept R+1 
locally if you haven't committed R*

So a replica can die and recover, but to recover and participate in paxos needs 
to learn the latest value.

This also gives you consistent read (at the possible cost of an extra read 
paxos proposal to ensure that the last paxos round is committed if left 
ambiguous)



 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0

 Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, 
 half-baked commit 3.jpg


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5062) Support CAS

2013-03-04 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13592748#comment-13592748
 ] 

Cristian Opris commented on CASSANDRA-5062:
---

So I think what you're doing at the moment is effectively using the (R,P) tuple 
as the 
proposal number within a single continuous Paxos instance, that sometimes may 
agree on things
and sometimes replicas learn the agreed value.

 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0

 Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, 
 half-baked commit 3.jpg


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5062) Support CAS

2013-03-03 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13591778#comment-13591778
 ] 

Cristian Opris commented on CASSANDRA-5062:
---

Jonathan, even if you could rely on monotonically increasing timestamps (which 
is a big assumption), I don't think this will work because it does not clearly 
demarcate between paxos rounds.

So you could have a scenario where you end up with different values committed 
at each replica:

{code}
 R1R2   R3 
1. C0C0   C0  //initial state ts=0
2. P1  P1   //R3 initiates proposal ts=1 
3. A1  A1   //accept ts=1
4.C1  //R2 has majority, commits ts=1
5.  P2   P2 //R1 initiates proposal ts=2
6.  A2   A2 //accept ts=2; note this breaks Paxos since R1 
should have chosen A1
7. C2 //R1 commits C2 
{code}

After step 7, R1=C2, R2=C0, R3=C1

If a read comes in at this point, what would it resolve to ? You could say use 
the highest timestamp but that would require a read ALL

More importantly, if a CAS request comes in, the validation of that depends on 
which replica it executes (unless again we do a read ALL before)

The reason I suggested version counters is because this allows a replica to 
detect
it has missed paxos rounds and needs to sync up before proceeding.

The example above modified:

{code}
 R1R2   R3 
1. C0C0   C0  //initial state v=0
2. P1  P1   //R3 initiates proposal v=1 
3. A1  A1   //accept ts=1
4.C1  //R2 has majority, commits ts=1
5a.  P1' //R1 wants to initiate its own P1
5b.   nack P1'  //but rejected since already committed
5c.   read C1   //read and commit C1 (finish round 1)
5d.  P2  //restarts proposal with v=2
5e.P2  C1   //R2 receives P2 and notices it's missing C1 
which it needs to commit first
6.   A2A2   //accept v=2; this is ok for Paxos as it's truly 
a new round 
7. C2 //R1 commits C2 
{code}

After step 7
R1=(C2,A2) R2=(C1,A2) R3=(C1,A1)

The most ambiguous quorum is R2,R3. Let's even assume that R1 has failed.
The ambiguity can still be solved by initiating a new paxos round at version 
v=2 which will necessarily accept and commit A2. (this follows from Paxos)

So to have a consistent read, the read might perform a paxos round to commit A2.

This is a sketch of a proof this is correct:
- if no replica can participate in a paxos round for version V, as acceptor or 
proposer, until it learns and commits locally the previous version V-1
- then for Paxos to achieve a quorum of accept at V, a quorum of replicas must 
have committed V-1
- once a quorum has accepted the same value for V, all replicas can eventually 
learn and commit V by simply rerunning a paxos round at V with value Nil (this 
can be triggered by an attempt to write V+1, or a read as shown above)

 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0

 Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, 
 half-baked commit 3.jpg


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5062) Support CAS

2013-03-03 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13591811#comment-13591811
 ] 

Cristian Opris commented on CASSANDRA-5062:
---

Ok, but I think my point was that even if you can assume that (monotonic time) 
than it's still not
correct, because when a proposal with a new value and higher timestamp than 
last committed comes in,
accepting it over a previously accepted value would violate paxos. That is step 
6 in my first example there. This at least breaks cas and cannot give 
consistent read

However I confess I don't fully understand your solution, could you summarize 
or formalize a bit ?

 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0

 Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, 
 half-baked commit 3.jpg


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5062) Support CAS

2013-03-03 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13591828#comment-13591828
 ] 

Cristian Opris commented on CASSANDRA-5062:
---

One more example of how mostRecentCommit is ambiguous:

{code}
R1   R2  R3
  Ct0  Ct0 Ct0//initial state at t0
 Atn Atn//accept at Tn  t0
  Atn - Ctn  //R3 commits Ctn, mostRecentCommit = tn, 
Accept is cleared !
Atn+mAtn+m  //R3 accepts new value at tn+m  tn, this is 
valid since accept has been cleared
Atn+m - Ctn+m//ambiguous state with R1=Ctn+m, R2=Ct0, 
R3=Ctn, needs read ALL to resolve
{code} 
  

 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0

 Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, 
 half-baked commit 3.jpg


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5062) Support CAS

2013-03-03 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13591890#comment-13591890
 ] 

Cristian Opris commented on CASSANDRA-5062:
---

OK, I believe what you're proposing is very close to what I am thinking.

Essentially you're using mostRecentCommit timestamp (mrc) to track the paxos 
instance, while I am proposing to use a sequence value that is incremented on 
local commit.

I expect that in your case as well this epoch number let's call it is different 
from proposal 
number, which can indeed be a timestamp (timeuuid)

It seems this epoch doesn't have to be sequential so timestamp could work. (I 
would still go with
a sequence just not to depend on the clock at all, but it's not necessary)

I reworked the example above with more detail, and seems correct:

{code}

R1   R2R3
  Ct0  Ct0   Ct0   //initial state at t0
Ptn(epoch=t0) -   //R3 makes a proposal numbered tn with mRC=t0
 promise(Ptn) --  //R2 promises 
 Atn Atn //accept at Tn  t0
  Atn - Ctn   //R3 commits Ctn, mrc=tn, accept is cleared
--- Ptn+m(mrc=t0)//R1 makes a proposal tn+m with mRC=t0, last 
it knows of
--- nack (Ctn)//R3 rejects since stale mRC; send Ctn 
directly for R1 to learn
 Ctn
--- Ptn+m(mrc=tn) //propose again at mRC=tn
   
- ok  //R3 promises  
Atn+mAtn+m   //R3 accepts new value at tn+m  tn, this is 
now valid
Ctn+m
{code}

State:
R1=(Ctn+m), R2=(Ct0,Atn), R3=(Ctn,Atn+m)

Now I think this is pretty much like the variant with version counter above.

To do a consistent read, the read may have to perform the completion of the 
paxos round for Atn+m
but it's guaranteed to resolve to Ctn+m whatever quorum it reads.





 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0

 Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, 
 half-baked commit 3.jpg


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (CASSANDRA-5062) Support CAS

2013-03-03 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13591890#comment-13591890
 ] 

Cristian Opris edited comment on CASSANDRA-5062 at 3/3/13 11:09 PM:


OK, I believe what you're proposing is very close to what I am thinking.

Essentially you're using mostRecentCommit timestamp (mrc) to track the paxos 
instance, while I am proposing to use a sequence value that is incremented on 
local commit.

I expect that in your case as well this epoch number let's call it is different 
from proposal 
number, which can indeed be a timestamp (timeuuid)

It seems this epoch doesn't have to be sequential so timestamp could work. (I 
would still go with
a sequence just not to depend on the clock at all, but it's not necessary)

I reworked the example above with more detail, and seems correct:

{code}

R1   R2R3
  Ct0  Ct0   Ct0   //initial state at t0
Ptn(mrc=t0) - //R3 makes a proposal numbered tn with most 
recent commited t0
 --   ok   -- //R2 promises 
 Atn Atn //accept at Tn  t0
  Atn - Ctn   //R3 commits Ctn, mrc=tn, accept is cleared
--- Ptn+m(mrc=t0)//R1 makes a proposal tn+m with mRC=t0, last 
it knows of
--- nack (Ctn)//R3 rejects since stale mRC; send Ctn 
directly for R1 to learn
 Ctn
--- Ptn+m(mrc=tn) //propose again at mrc=tn
   
- ok  //R3 promises since mrc up to date
Atn+mAtn+m   //R3 accepts new value at tn+m  tn
Ctn+m
{code}

State:
R1=(Ctn+m), R2=(Ct0,Atn), R3=(Ctn,Atn+m)

Now I think this is pretty much like the variant with version counter above.

To do a consistent read, the read may have to perform the completion of the 
paxos round for Atn+m
but it's guaranteed to resolve to Ctn+m whatever quorum it reads.





  was (Author: onetoinfin...@yahoo.com):
OK, I believe what you're proposing is very close to what I am thinking.

Essentially you're using mostRecentCommit timestamp (mrc) to track the paxos 
instance, while I am proposing to use a sequence value that is incremented on 
local commit.

I expect that in your case as well this epoch number let's call it is different 
from proposal 
number, which can indeed be a timestamp (timeuuid)

It seems this epoch doesn't have to be sequential so timestamp could work. (I 
would still go with
a sequence just not to depend on the clock at all, but it's not necessary)

I reworked the example above with more detail, and seems correct:

{code}

R1   R2R3
  Ct0  Ct0   Ct0   //initial state at t0
Ptn(epoch=t0) -   //R3 makes a proposal numbered tn with mRC=t0
 promise(Ptn) --  //R2 promises 
 Atn Atn //accept at Tn  t0
  Atn - Ctn   //R3 commits Ctn, mrc=tn, accept is cleared
--- Ptn+m(mrc=t0)//R1 makes a proposal tn+m with mRC=t0, last 
it knows of
--- nack (Ctn)//R3 rejects since stale mRC; send Ctn 
directly for R1 to learn
 Ctn
--- Ptn+m(mrc=tn) //propose again at mRC=tn
   
- ok  //R3 promises  
Atn+mAtn+m   //R3 accepts new value at tn+m  tn, this is 
now valid
Ctn+m
{code}

State:
R1=(Ctn+m), R2=(Ct0,Atn), R3=(Ctn,Atn+m)

Now I think this is pretty much like the variant with version counter above.

To do a consistent read, the read may have to perform the completion of the 
paxos round for Atn+m
but it's guaranteed to resolve to Ctn+m whatever quorum it reads.




  
 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0

 Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, 
 half-baked commit 3.jpg


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5062) Support CAS

2013-03-02 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13591568#comment-13591568
 ] 

Cristian Opris commented on CASSANDRA-5062:
---

[~luiscarneiro] What may happen with this is you read a value from the most 
advanced replica and then you try a CAS at a stale replica which will deny it 
even if it's legit, because it does not match its stale value.

I think something like this may work where you track a version counter for each 
row and you make sure you advance paxos rounds (and version counter) one at a 
time per quorum.

Basically the invariant is that a replica initiates or participates in paxos 
round V only after
it has committed V-1 locally, which can happen when:
- it learns a  *majority* has *accepted* a value at V-1 so it can *commit* V-1 
locally (i.e. paxos round V-1 is settled)
- it learns that *any* replica has *committed* V-1

I am still fuzzy how this can be accomplished exactly but the invariants seem 
good.


 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0

 Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, 
 half-baked commit 3.jpg


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5062) Support CAS

2013-03-02 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13591570#comment-13591570
 ] 

Cristian Opris commented on CASSANDRA-5062:
---

Note that the version counter is per row, and this would only require keeping 
the last committed and the last accepted values for each row (no log necessary)

 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0

 Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, 
 half-baked commit 3.jpg


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5062) Support CAS

2013-03-01 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13590705#comment-13590705
 ] 

Cristian Opris commented on CASSANDRA-5062:
---

There is this paper that might be of interest, Consensus on Transaction Commit:
http://research.microsoft.com/apps/pubs/default.aspx?id=64636

I haven't yet studied it in detail but may give some ideas.

Paxos made live seems centered on the idea of having a replicated log. Not 
sure this applies to what we want to do. There are details on how to do that 
however in the papers cited, the more relevant I think:

Lampson, B. W. How to build a highly available system using consensus.
Schneider, F. B. Implementing fault-tolerant services using the state machine 
approach: A tutorial.

Google has links to the papers


 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5062) Support CAS

2013-02-28 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589753#comment-13589753
 ] 

Cristian Opris commented on CASSANDRA-5062:
---

This WILL be exposed to Thrift as well, right ?

 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Issue Comment Deleted] (CASSANDRA-5062) Support CAS

2013-02-28 Thread Cristian Opris (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cristian Opris updated CASSANDRA-5062:
--

Comment: was deleted

(was: This WILL be exposed to Thrift as well, right ?)

 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5062) Support CAS

2013-02-28 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589771#comment-13589771
 ] 

Cristian Opris commented on CASSANDRA-5062:
---

Jonathan, how do you plan fitting CAS into Paxos ? Paxos would give consensus, 
but what would the consesus be on ? The value to write ? 

Is the CAS run at each replica or just the proposer ? How do you make sure when 
you run CAS locally you have actually learned the *previous* consensus value 
(to compare expected with) ?

 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5062) Support CAS

2013-02-28 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589775#comment-13589775
 ] 

Cristian Opris commented on CASSANDRA-5062:
---

I believe UPDATE statements in SQL return the number of rows affected. You 
could do the same here (for however you define row in CQL)

 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5062) Support CAS

2013-02-27 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588316#comment-13588316
 ] 

Cristian Opris commented on CASSANDRA-5062:
---

Zab is not Paxos just vaguely resembles it. Zab leader replicates a totally 
ordered log of idempotent operations to ALL followers. It requires a quorum of 
followers to acknowledge the write before committing on the leader, and then 
commits on the followers. When leader fails, the new leader is the one that is 
most up-to-date with the writes (highest log sequence number) so that one will 
necessarily have all the committed writes (If it does not have the commit for a 
particular write I believe it can assume it's been committed, I'm a bit unclear 
on this point).

The new leader needs to fully synchronize all the replicas and establish a 
quorum before writes can resume. That may introduce a small period of 
unavailability.

At least in ZK I believe clients connect to a single replica and may be behind 
the leader with reads but they will always see all the writes (including their 
own since they're forwarded to leader and replicated back) in consistent order

 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5062) Support CAS

2013-02-27 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588328#comment-13588328
 ] 

Cristian Opris commented on CASSANDRA-5062:
---

On the other hand Paxos for each CAS would be quite different. 

The basic approach would be to have each CAS be a full Paxos round (Phase 1: 
prepare/promise, Phase 2: propose/accept). In this case each round is 
independent and writes can happen concurrently (as opposed to Zab where all 
writes are applied serially cluster-wide).

There doesn't even need to be a leader, that is an optimisation to ensure 
liveness (avoid duelling proposers). 

Now since full Paxos is quite expensive in terms of roundtrips, there are 
optimisations to reduce that (see Fast Paxos in the wikipedia article) but I 
have yet to study the details of that.

There is also the question of how the actual CAS op would be integrated with 
Paxos (who does the CAS ? presumably the proposer needs to be able to do the 
CAS verify locally, or maybe acceptors can NACK if the CAS is rejected locally 
? Would that be a valid nack in Paxos terms ?) but that can be sorted out.



 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5062) Support CAS

2013-02-27 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588332#comment-13588332
 ] 

Cristian Opris commented on CASSANDRA-5062:
---

Re. which storage to use for metadata, why not use a meta-column family, like 
for secondary indexes, or like the locks would have required ? 

For Zab a persistent log will be necessary, and for Paxos a way to persist the 
paxos round state for each row.

 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5062) Support CAS

2013-02-27 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588521#comment-13588521
 ] 

Cristian Opris commented on CASSANDRA-5062:
---

The Zab paper: research.yahoo.com/files/ladis08.pdf

 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (CASSANDRA-5062) Support CAS

2013-02-27 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588521#comment-13588521
 ] 

Cristian Opris edited comment on CASSANDRA-5062 at 2/27/13 5:12 PM:


The Zab paper: http://research.yahoo.com/files/ladis08.pdf

  was (Author: onetoinfin...@yahoo.com):
The Zab paper: research.yahoo.com/files/ladis08.pdf
  
 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5062) Support CAS

2013-02-27 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588542#comment-13588542
 ] 

Cristian Opris commented on CASSANDRA-5062:
---

In the Zab paper in 4.1 it says ??We are able to simplify the two-phase commit 
protocol because we do not have aborts; followers either acknowledge the 
leader’s proposal or they abandon the leader. *The lack of aborts also mean 
that we can commit once a quorum of servers ack the proposal rather than 
waiting for all servers to respond.* This simplified two- phase commit by 
itself cannot handle leader failures, so we will add recovery mode to handle 
leader failures.??

So basically one a proposal is acked by a quorum there is no going back (no 
abort). The leader has to succeed in committing that or else it will lose its 
leadership.

If the client times out in the meantime it has to retry and find out what the 
result was. Presumably this can happen with regular ACID databases as well, 
where a client sends COMMIT TX and times out immediately after that.

 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (CASSANDRA-5062) Support CAS

2013-02-27 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588542#comment-13588542
 ] 

Cristian Opris edited comment on CASSANDRA-5062 at 2/27/13 5:32 PM:


In the Zab paper in 4.1 it says ??We are able to simplify the two-phase commit 
protocol because we do not have aborts; followers either acknowledge the 
leader’s proposal or they abandon the leader. *The lack of aborts also mean 
that we can commit once a quorum of servers ack the proposal rather than 
waiting for all servers to respond.* This simplified two- phase commit by 
itself cannot handle leader failures, so we will add recovery mode to handle 
leader failures.??

So basically once a proposal is acked by a quorum there is no going back (no 
abort). The leader has to succeed in committing that or else it will lose its 
leadership.

If the client times out in the meantime it has to retry and find out what the 
result was. Presumably this can happen with regular ACID databases as well, 
where a client sends COMMIT TX and times out immediately after that.

  was (Author: onetoinfin...@yahoo.com):
In the Zab paper in 4.1 it says ??We are able to simplify the two-phase 
commit protocol because we do not have aborts; followers either acknowledge the 
leader’s proposal or they abandon the leader. *The lack of aborts also mean 
that we can commit once a quorum of servers ack the proposal rather than 
waiting for all servers to respond.* This simplified two- phase commit by 
itself cannot handle leader failures, so we will add recovery mode to handle 
leader failures.??

So basically one a proposal is acked by a quorum there is no going back (no 
abort). The leader has to succeed in committing that or else it will lose its 
leadership.

If the client times out in the meantime it has to retry and find out what the 
result was. Presumably this can happen with regular ACID databases as well, 
where a client sends COMMIT TX and times out immediately after that.
  
 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5062) Support CAS

2013-02-27 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588567#comment-13588567
 ] 

Cristian Opris commented on CASSANDRA-5062:
---

Note that a proposal may eventually succeed on recovery even if a less than a 
quorum has managed to ack it before the leader fails (and the client timed 
out). The need for quorum writes is to be able to survive F failures out of 
2F+1 replicas. Reads are not quorum, just replica local reads.

Let's say we have 5 replicas, F1 leader, F4 and F5 are ignored here as they 
don't matter
{{
1a. F1 - proposal - F2
1b. F1 -  ack - F2
2a. F1 - proposal - F3
2b. F1 -  ack - F3
3a F1 -  OK  - client
3b F1 - COMMIT   - F2,F3
}}

If F1 fails immediately after step 1b, F2 would become the leader since he has 
the latest seq number. Now only F2 has the proposal but it can continue and 
commit it to the other followers.
If it can't get a quorum (maybe it's partitioned in a minority) then it gives 
up leadership. When it rejoins the majority, it runs another recovery procedure 
that uses epoch numbers to determine if it needs to throw away that proposal. 
This is fine since no client has actually been confirmed that the proposal has 
been committed. This is detailed in the paper.


 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (CASSANDRA-5062) Support CAS

2013-02-27 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588567#comment-13588567
 ] 

Cristian Opris edited comment on CASSANDRA-5062 at 2/27/13 6:09 PM:


Note that a proposal may eventually succeed on recovery even if a less than a 
quorum has managed to ack it before the leader fails (and the client timed 
out). The need for quorum writes is to be able to survive F failures out of 
2F+1 replicas. Reads are not quorum, just replica local reads.

Let's say we have 5 replicas, F1 leader, F4 and F5 are ignored here as they 
don't matter
{code}
1a F1 - proposal - F2
1b F1 -  ack - F2
2a F1 - proposal - F3
2b F1 -  ack - F3
3a F1 -  OK  - client
3b F1 - COMMIT   - F2,F3
{code}

If F1 fails immediately after step 1b, F2 would become the leader since he has 
the latest seq number. Now only F2 has the proposal but it can continue and 
commit it to the other followers.
If it can't get a quorum (maybe it's partitioned in a minority) then it gives 
up leadership. When it rejoins the majority, it runs another recovery procedure 
that uses epoch numbers to determine if it needs to throw away that proposal. 
This is fine since no client has actually been confirmed that the proposal has 
been committed. This is detailed in the paper.


  was (Author: onetoinfin...@yahoo.com):
Note that a proposal may eventually succeed on recovery even if a less than 
a quorum has managed to ack it before the leader fails (and the client timed 
out). The need for quorum writes is to be able to survive F failures out of 
2F+1 replicas. Reads are not quorum, just replica local reads.

Let's say we have 5 replicas, F1 leader, F4 and F5 are ignored here as they 
don't matter
{{
1a. F1 - proposal - F2
1b. F1 -  ack - F2
2a. F1 - proposal - F3
2b. F1 -  ack - F3
3a F1 -  OK  - client
3b F1 - COMMIT   - F2,F3
}}

If F1 fails immediately after step 1b, F2 would become the leader since he has 
the latest seq number. Now only F2 has the proposal but it can continue and 
commit it to the other followers.
If it can't get a quorum (maybe it's partitioned in a minority) then it gives 
up leadership. When it rejoins the majority, it runs another recovery procedure 
that uses epoch numbers to determine if it needs to throw away that proposal. 
This is fine since no client has actually been confirmed that the proposal has 
been committed. This is detailed in the paper.

  
 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (CASSANDRA-5062) Support CAS

2013-02-26 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586371#comment-13586371
 ] 

Cristian Opris edited comment on CASSANDRA-5062 at 2/26/13 6:23 PM:


Afaict from the Spinnaker paper they only require ZK for fault tolerant leader 
election, failure detection and possibly cluster membership. (The right lower 
box in the diagram in 4.1) The rest of it is their actual data storage engine.

A few more comments:

1. Paxos can be made very efficient particularly in stable operation scenarios. 
-I believe Zab devolves effectively in atomic broadcast (not even 2PC) with a 
stable leader. So you can normally do writes with a single roundtrip just like 
now.-
Edit: Zab requires 4 delays (2 roundtrips) actually

2. There is a difference between what I described above and what Spinnaker 
does. I believe they elect a leader for the entire replica group while my 
description assumes 1 full paxos instance per row write. I'm not fully clear 
atm how this would work but I believe even that can be optimized to single 
roundtrips per write in normal operation (I believe it's in one of Google's 
papers that they piggyback the commit on the next proposal for example) 

Off the top of my head: coordinator assumes one of the replicas as being most 
up-to-date, attempts to use it as leader. Replica starts Paxos round attaching 
the write payload. If accepted on a majority replica can send commit. 
Opportunistically attaches further proposals to it. If Paxos round fails (or a 
number of rounds fail) it's likely the replica is behind on many rows so 
coordinator switches to another replica.

Now this is all preliminary as I haven't fully thought this through but I think 
it's definitely worth investigating. While it may be a complicated protocol it 
has significant performance advantages over locks. Just count how many 
roundtrips you'd need in the wait chain algorithm. Not to mentioned handling 
expired/orphan locks



  was (Author: onetoinfin...@yahoo.com):
Afaict from the Spinnaker paper they only require ZK for fault tolerant 
leader election, failure detection and possibly cluster membership. (The right 
lower box in the diagram in 4.1) The rest of it is their actual data storage 
engine.

A few more comments:

1. Paxos can be made very efficient particularly in stable operation scenarios. 
I believe Zab devolves effectively in atomic broadcast (not even 2PC) with a 
stable leader. So you can normally do writes with a single roundtrip just like 
now. 

2. There is a difference between what I described above and what Spinnaker 
does. I believe they elect a leader for the entire replica group while my 
description assumes 1 full paxos instance per row write. I'm not fully clear 
atm how this would work but I believe even that can be optimized to single 
roundtrips per write in normal operation (I believe it's in one of Google's 
papers that they piggyback the commit on the next proposal for example) 

Off the top of my head: coordinator assumes one of the replicas as being most 
up-to-date, attempts to use it as leader. Replica starts Paxos round attaching 
the write payload. If accepted on a majority replica can send commit. 
Opportunistically attaches further proposals to it. If Paxos round fails (or a 
number of rounds fail) it's likely the replica is behind on many rows so 
coordinator switches to another replica.

Now this is all preliminary as I haven't fully thought this through but I think 
it's definitely worth investigating. While it may be a complicated protocol it 
has significant performance advantages over locks. Just count how many 
roundtrips you'd need in the wait chain algorithm. Not to mentioned handling 
expired/orphan locks


  
 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (CASSANDRA-5062) Support CAS

2013-02-26 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586394#comment-13586394
 ] 

Cristian Opris edited comment on CASSANDRA-5062 at 2/26/13 6:24 PM:


So I guess what I'm proposing is similar to what Piotr said above: each CAS is 
a round of Paxos.
With some cleverness this can be collapsed to Multi-Paxos. 
 
Spinnaker does leader election with ZK precisely because they did not want to 
implement Paxos themselves. 

From the paper, section 5: The replication protocol has two phases: a leader 
election phase, followed by a quorum phase where the leader proposes a write 
and the followers accept it.

-That is Multi-Paxos, with first phase (leader election) handled by ZK and 
second phase being the steady state (propose/accept) with the actual 
write/commit-
Edit: it's not, see below

  was (Author: onetoinfin...@yahoo.com):
So I guess what I'm proposing is similar to what Piotr said above: each CAS 
is a round of Paxos.
With some cleverness this can be collapsed to Multi-Paxos. 
 

Spinnaker does leader election with ZK precisely because they did not want to 
implement Paxos themselves. 

From the paper, section 5: The replication protocol has two phases: a leader 
election phase, followed by a quorum phase where the leader proposes a write 
and the followers accept it.

That is Multi-Paxos, with first phase (leader election) handled by ZK and 
second phase being the steady state (propose/accept) with the actual 
write/commit
  
 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5062) Support CAS

2013-02-26 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13587399#comment-13587399
 ] 

Cristian Opris commented on CASSANDRA-5062:
---

Rereading the papers, Spinnaker, Zab and  Sergio's option 3) are pretty much 
the same thing.

The alternative is to do a true Paxos instance for each CAS round but not clear 
how that can be done efficiently (and simply)





 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5062) Support CAS

2013-02-25 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586290#comment-13586290
 ] 

Cristian Opris commented on CASSANDRA-5062:
---

This shouldn't be too complicated with Paxos leader election very similar to 
Spinnaker

I don't think it requires changing the read/write paths at the lower level, at 
least not significantly.

Assume for the sake of simplicity that we use a column prefix to encode the 
version

The leader elected should always be the one that has the latest version.

This allows the leader to perform read-modify-write (conditional update) 
locally and do a simple quorum write to propagate that if successful.

The leader can also increment the version sequentially.

Conflicting writes from other replicas cannot succeed because any node that 
wants to write needs to get itself elected reader first.

Since we do quorum writes not all replicas will have the full sequence of 
versions but regular anti-entropy (read-repair) on quorum reads should take 
care of that.
  
If the leader fails the newly elected leader necessarily will be the one that 
has the latest write so it can continue to do cas locally.

Anti-entropy should also take care of recovery and catch-up of a replica just 
like now.

I believe this can all be done on top of existing functionality without major 
changes to read/write paths

You could also reuse the Zab algorithm from ZK for expediency without using 
having to use the entire 
ZK codebase.




 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (CASSANDRA-5062) Support CAS

2013-02-25 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586290#comment-13586290
 ] 

Cristian Opris edited comment on CASSANDRA-5062 at 2/25/13 9:14 PM:


This shouldn't be too complicated with Paxos leader election very similar to 
Spinnaker

I don't think it requires changing the read/write paths at the lower level, at 
least not significantly.

Assume for the sake of simplicity that we use a column prefix to encode the 
version

The leader elected should always be the one that has the latest version.

This allows the leader to perform read-modify-write (conditional update) 
locally and do a simple quorum write to propagate that if successful.

The leader can also increment the version sequentially.

Conflicting writes from other replicas cannot succeed because any node that 
wants to write needs to get itself elected leader first.

Since we do quorum writes not all replicas will have the full sequence of 
versions but regular anti-entropy (read-repair) on quorum reads should take 
care of that.
  
If the leader fails the newly elected leader necessarily will be the one that 
has the latest write so it can continue to do cas locally.

Anti-entropy should also take care of recovery and catch-up of a replica just 
like now.

I believe this can all be done on top of existing functionality without major 
changes to read/write paths

You could also reuse the Zab algorithm from ZK for expediency without using 
having to use the entire 
ZK codebase.




  was (Author: onetoinfin...@yahoo.com):
This shouldn't be too complicated with Paxos leader election very similar 
to Spinnaker

I don't think it requires changing the read/write paths at the lower level, at 
least not significantly.

Assume for the sake of simplicity that we use a column prefix to encode the 
version

The leader elected should always be the one that has the latest version.

This allows the leader to perform read-modify-write (conditional update) 
locally and do a simple quorum write to propagate that if successful.

The leader can also increment the version sequentially.

Conflicting writes from other replicas cannot succeed because any node that 
wants to write needs to get itself elected reader first.

Since we do quorum writes not all replicas will have the full sequence of 
versions but regular anti-entropy (read-repair) on quorum reads should take 
care of that.
  
If the leader fails the newly elected leader necessarily will be the one that 
has the latest write so it can continue to do cas locally.

Anti-entropy should also take care of recovery and catch-up of a replica just 
like now.

I believe this can all be done on top of existing functionality without major 
changes to read/write paths

You could also reuse the Zab algorithm from ZK for expediency without using 
having to use the entire 
ZK codebase.



  
 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (CASSANDRA-5062) Support CAS

2013-02-25 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586290#comment-13586290
 ] 

Cristian Opris edited comment on CASSANDRA-5062 at 2/25/13 9:16 PM:


This shouldn't be too complicated with Paxos leader election very similar to 
Spinnaker

I don't think it requires changing the read/write paths at the lower level, at 
least not significantly.

Assume for the sake of simplicity that we use a column prefix to encode the 
version

The leader elected should always be the one that has the latest version.

This allows the leader to perform read-modify-write (conditional update) 
locally and do a simple quorum write to propagate that if successful.

The leader can also increment the version sequentially.

Conflicting writes from other replicas cannot succeed because any node that 
wants to write needs to get itself elected leader first.

Since we do quorum writes not all replicas will have the full sequence of 
versions but regular anti-entropy (read-repair) on quorum reads should take 
care of that.
  
If the leader fails the newly elected leader necessarily will be the one that 
has the latest write so it can continue to do cas locally.

Anti-entropy should also take care of recovery and catch-up of a replica just 
like now.

I believe this can all be done on top of existing functionality without major 
changes to read/write paths

You could also reuse the Zab algorithm from ZK for expediency without depending 
on the entire
ZK codebase.




  was (Author: onetoinfin...@yahoo.com):
This shouldn't be too complicated with Paxos leader election very similar 
to Spinnaker

I don't think it requires changing the read/write paths at the lower level, at 
least not significantly.

Assume for the sake of simplicity that we use a column prefix to encode the 
version

The leader elected should always be the one that has the latest version.

This allows the leader to perform read-modify-write (conditional update) 
locally and do a simple quorum write to propagate that if successful.

The leader can also increment the version sequentially.

Conflicting writes from other replicas cannot succeed because any node that 
wants to write needs to get itself elected leader first.

Since we do quorum writes not all replicas will have the full sequence of 
versions but regular anti-entropy (read-repair) on quorum reads should take 
care of that.
  
If the leader fails the newly elected leader necessarily will be the one that 
has the latest write so it can continue to do cas locally.

Anti-entropy should also take care of recovery and catch-up of a replica just 
like now.

I believe this can all be done on top of existing functionality without major 
changes to read/write paths

You could also reuse the Zab algorithm from ZK for expediency without using 
having to use the entire 
ZK codebase.



  
 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5062) Support CAS

2013-02-25 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586371#comment-13586371
 ] 

Cristian Opris commented on CASSANDRA-5062:
---

Afaict from the Spinnaker paper they only require ZK for fault tolerant leader 
election, failure detection and possibly cluster membership. (The right lower 
box in the diagram in 4.1) The rest of it their actual data storage engine.

A few more comments:

1. Paxos can be made very efficient particularly in stable operation scenarios. 
I believe Zab devolves effectively in atomic broadcast (not even 2PC) with a 
stable leader. So you can normally do writes with a single roundtrip just like 
now. 

2. There is a difference between what I described above and what Spinnaker 
does. I believe they elect a leader for the entire replica group while my 
description assumes 1 full paxos instance per row write. I'm not fully clear 
atm how this would work but I believe even that can be optimized to single 
roundtrips per write in normal operation (I believe it's in one of Google's 
papers that they piggyback the commit on the next proposal for example) 

Off the top of my head: coordinator assumes one of the replicas as being most 
up-to-date, attempts to use it as leader. Replica starts Paxos round attaching 
the write payload. If accepted on a majority replica can send commit. 
Opportunistically attaches further proposals to it. If Paxos round fails (or a 
number of rounds fail) it's likely the replica is behind on many rows so 
coordinator switches to another replica.

Now this is all preliminary as I haven't fully thought this through but I think 
it's definitely worth investigating. While it may be a complicated protocol it 
has significan performance advantages over locks. Just count how many 
roundtrips you'd need in the wait chain algorithm. Not to mentioned handling 
expired/orphan locks



 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (CASSANDRA-5062) Support CAS

2013-02-25 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586371#comment-13586371
 ] 

Cristian Opris edited comment on CASSANDRA-5062 at 2/25/13 10:21 PM:
-

Afaict from the Spinnaker paper they only require ZK for fault tolerant leader 
election, failure detection and possibly cluster membership. (The right lower 
box in the diagram in 4.1) The rest of it is their actual data storage engine.

A few more comments:

1. Paxos can be made very efficient particularly in stable operation scenarios. 
I believe Zab devolves effectively in atomic broadcast (not even 2PC) with a 
stable leader. So you can normally do writes with a single roundtrip just like 
now. 

2. There is a difference between what I described above and what Spinnaker 
does. I believe they elect a leader for the entire replica group while my 
description assumes 1 full paxos instance per row write. I'm not fully clear 
atm how this would work but I believe even that can be optimized to single 
roundtrips per write in normal operation (I believe it's in one of Google's 
papers that they piggyback the commit on the next proposal for example) 

Off the top of my head: coordinator assumes one of the replicas as being most 
up-to-date, attempts to use it as leader. Replica starts Paxos round attaching 
the write payload. If accepted on a majority replica can send commit. 
Opportunistically attaches further proposals to it. If Paxos round fails (or a 
number of rounds fail) it's likely the replica is behind on many rows so 
coordinator switches to another replica.

Now this is all preliminary as I haven't fully thought this through but I think 
it's definitely worth investigating. While it may be a complicated protocol it 
has significant performance advantages over locks. Just count how many 
roundtrips you'd need in the wait chain algorithm. Not to mentioned handling 
expired/orphan locks



  was (Author: onetoinfin...@yahoo.com):
Afaict from the Spinnaker paper they only require ZK for fault tolerant 
leader election, failure detection and possibly cluster membership. (The right 
lower box in the diagram in 4.1) The rest of it their actual data storage 
engine.

A few more comments:

1. Paxos can be made very efficient particularly in stable operation scenarios. 
I believe Zab devolves effectively in atomic broadcast (not even 2PC) with a 
stable leader. So you can normally do writes with a single roundtrip just like 
now. 

2. There is a difference between what I described above and what Spinnaker 
does. I believe they elect a leader for the entire replica group while my 
description assumes 1 full paxos instance per row write. I'm not fully clear 
atm how this would work but I believe even that can be optimized to single 
roundtrips per write in normal operation (I believe it's in one of Google's 
papers that they piggyback the commit on the next proposal for example) 

Off the top of my head: coordinator assumes one of the replicas as being most 
up-to-date, attempts to use it as leader. Replica starts Paxos round attaching 
the write payload. If accepted on a majority replica can send commit. 
Opportunistically attaches further proposals to it. If Paxos round fails (or a 
number of rounds fail) it's likely the replica is behind on many rows so 
coordinator switches to another replica.

Now this is all preliminary as I haven't fully thought this through but I think 
it's definitely worth investigating. While it may be a complicated protocol it 
has significan performance advantages over locks. Just count how many 
roundtrips you'd need in the wait chain algorithm. Not to mentioned handling 
expired/orphan locks


  
 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5062) Support CAS

2013-02-25 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586376#comment-13586376
 ] 

Cristian Opris commented on CASSANDRA-5062:
---

See Multi-Paxos in the wikipedia article: 
http://en.wikipedia.org/wiki/Paxos_%28computer_science%29#Multi-Paxos

 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5062) Support CAS

2013-02-25 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586394#comment-13586394
 ] 

Cristian Opris commented on CASSANDRA-5062:
---

So I guess what I'm proposing is similar to what Piotr said above: each CAS is 
a round of Paxos.
With some cleverness this can be collapsed to Multi-Paxos. 
 

Spinnaker does leader election with ZK precisely because they did not want to 
implement Paxos themselves. 

From the paper, section 5: The replication protocol has two phases: a leader 
election phase, followed by a quorum phase where the leader proposes a write 
and the followers accept it.

That is Multi-Paxos, with first phase (leader election) handled by ZK and 
second phase being the steady state (propose/accept) with the actual 
write/commit

 Support CAS
 ---

 Key: CASSANDRA-5062
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
 Project: Cassandra
  Issue Type: New Feature
  Components: API, Core
Reporter: Jonathan Ellis
 Fix For: 2.0


 Strong consistency is not enough to prevent race conditions.  The classic 
 example is user account creation: we want to ensure usernames are unique, so 
 we only want to signal account creation success if nobody else has created 
 the account yet.  But naive read-then-write allows clients to race and both 
 think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (CASSANDRA-4989) Expose new SliceQueryFilter features through Thrift interface

2012-11-23 Thread Cristian Opris (JIRA)
Cristian Opris created CASSANDRA-4989:
-

 Summary: Expose new SliceQueryFilter features through Thrift 
interface
 Key: CASSANDRA-4989
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4989
 Project: Cassandra
  Issue Type: Improvement
  Components: API
Affects Versions: 1.2.0, 1.2.1, 1.3
Reporter: Cristian Opris


SliceQueryFilter has some very useful new features like ability to specify a 
composite column prefix to group by and specify a limit of groups to return.

This is very useful if for example I have a wide row with columns prefixed by 
timestamp and I want to retrieve the latest columns, but I don't know the 
column names. Say I have a row
{{row - (t1, c1), (t1, c2)... (t1, cn) ... (t0,c1) ... etc}}

Query slice range (t1,) group by prefix (1) limit (1)

As a more general question, is the Thrift interface going to be kept up-to-date 
with the feature changes or will it be left behind (a mistake IMO) ?



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4989) Expose new SliceQueryFilter features through Thrift interface

2012-11-23 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13503241#comment-13503241
 ] 

Cristian Opris commented on CASSANDRA-4989:
---

Sorry if I haven't been more clear. What I'd like is to do that query 
efficiently when I don't know t1 precisely, I
just want to get the latest columns before a time T

That can be done currently with Thrift but will return all columns with time t 
 T, while this way I can efficiently
get just the latest

Note that as of type queries are very common in financial type applications 
for example, so it's worth considering this
use case.

I'm not sure about the handling of deleted keys but maybe we can find a way to 
generalize and expose this ? I would have asked for a feature like this anyway, 
it just so happens that looking at the code I see this has been done to support 
CQL limits

Since I have an object serialization client API on top of Thrift, CQL is not 
much use to me...

 Expose new SliceQueryFilter features through Thrift interface
 -

 Key: CASSANDRA-4989
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4989
 Project: Cassandra
  Issue Type: Improvement
  Components: API
Affects Versions: 1.2.0, 1.2.1, 1.3
Reporter: Cristian Opris

 SliceQueryFilter has some very useful new features like ability to specify a 
 composite column prefix to group by and specify a limit of groups to return.
 This is very useful if for example I have a wide row with columns prefixed by 
 timestamp and I want to retrieve the latest columns, but I don't know the 
 column names. Say I have a row
 {{row - (t1, c1), (t1, c2)... (t1, cn) ... (t0,c1) ... etc}}
 Query slice range (t1,) group by prefix (1) limit (1)
 As a more general question, is the Thrift interface going to be kept 
 up-to-date with the feature changes or will it be left behind (a mistake IMO) 
 ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4977) Expose new SliceQueryFilter features through Thrift interface

2012-11-20 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13501124#comment-13501124
 ] 

Cristian Opris commented on CASSANDRA-4977:
---

This was posted by me but apparently was logged with an anonymous account at 
the time

 Expose new SliceQueryFilter features through Thrift interface
 -

 Key: CASSANDRA-4977
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4977
 Project: Cassandra
  Issue Type: Improvement
  Components: API
Affects Versions: 1.2.0 beta 2
Reporter: aaa

 SliceQueryFilter has some very useful new features like ability to specify a 
 composite column prefix to group by and specify a limit of groups to return.
 This is very useful if for example I have a wide row with columns prefixed by 
 timestamp and I want to retrieve the latest columns, but I don't know the 
 column names. Say I have a row
 {{row - (t1, c1), (t1, c2)... (t1, cn) ... (t0,c1) ... etc}}
 Query slice range (t1,) group by prefix (1) limit (1)
 As a more general question, is the Thrift interface going to be kept 
 up-to-date with the feature changes or will it be left behind (a mistake IMO) 
 ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira