[jira] [Commented] (CASSANDRA-4989) Expose new SliceQueryFilter features through Thrift interface
[ https://issues.apache.org/jira/browse/CASSANDRA-4989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13612553#comment-13612553 ] Cristian Opris commented on CASSANDRA-4989: --- Yes, Sylvain is correct. This is essentially an optimization to avoid iterating through the columns and just get the latest group that has a common prefix. I noticed this can be done with the new SliceQueryFilter so it would be useful if it can be exposed. If I'm allowed to go off on a tangent here (I know, not the best place) having more pluggable behaviour would be an interesting direction to take with Cassandra. Same way it's possible to have custom column comparators, maybe we could have pluggable row level indexes, pluggable queries to use them, pluggable notification systems, etc. I know this has been discussed before, just wanted to add my vote here. Thanks Expose new SliceQueryFilter features through Thrift interface - Key: CASSANDRA-4989 URL: https://issues.apache.org/jira/browse/CASSANDRA-4989 Project: Cassandra Issue Type: Improvement Components: API Affects Versions: 1.2.0, 1.2.1, 2.0 Reporter: Cristian Opris SliceQueryFilter has some very useful new features like ability to specify a composite column prefix to group by and specify a limit of groups to return. This is very useful if for example I have a wide row with columns prefixed by timestamp and I want to retrieve the latest columns, but I don't know the column names. Say I have a row {{row - (t1, c1), (t1, c2)... (t1, cn) ... (t0,c1) ... etc}} Query slice range (t1,) group by prefix (1) limit (1) As a more general question, is the Thrift interface going to be kept up-to-date with the feature changes or will it be left behind (a mistake IMO) ? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13593457#comment-13593457 ] Cristian Opris commented on CASSANDRA-5062: --- I understand what you mean by terminology but this is not where the confusion is coming from. My commit C1,C2 etc is your learn, agreed. My accept is your commit. It may be a bit confusing because I'm not detailing everything in the diagram So when Z goes into C1, that implies: it receives accept from Y, it commits (i.e. writes) the value locally and then it sends learn message to X and Y, which might fail without Z having any record of that. I know this is not the exact behaviour in your algoritm. I'm not sure how the leader commits (learns) the value locally, is it because it ends up calling receive(LEARN) locally (i.e. acting as acceptor as well) ? But this doesn't change my point. *My point is the learn can fail without the leader being aware, which leads to a state where each replica is at a different stage of learning. Even if the paxos round states are correct in terms of accepted values (what you call commit), they are not finished in terms of learning* Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, half-baked commit 3.jpg Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13592282#comment-13592282 ] Cristian Opris commented on CASSANDRA-5062: --- FWIW, just to clarify my own examples, I can't speak for Jonathan: *version counter or most recent commit is NOT the paxos proposal number*. The Paxos proposal number I've ommitted in most of my examples except for the last more detailed one. Timeuuid is fine for proposal number. Also with regard to logging/no logging. I believe you only need to keep a log if you plan to replicate operations rather than state. Transfering state (as we discussed so far) does not require a log but makes it impractical to replicate large values, so this is the main trade off, I don't believe it's got anything to do with paxos. Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, half-baked commit 3.jpg Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13592559#comment-13592559 ] Cristian Opris commented on CASSANDRA-5062: --- [~slebresne] I have read your pseudo-code, seems pretty much what I was trying to describe with the version counter that counts paxos rounds (except I was thinking at row level rather than column level) I noticed however that while the leader's proposal is aborted if it has a stale round, the acceptor algorithm does not handle the case when the acceptor replica is behind. Basically in the acceptor algorithm you don't seem to handle the case where C_current.timestamp() R One way to do that is to nack the proposal indicating it needs to catch up and either expect to receive a snapshot from the leader or do a read. Also note you don't need to send the column values with proposal. If you get quorum for the proposal you can perform the CAS locally and just send the new column value. Essentially consensus is on the next column value to write, not the CAS. Since proposer is guaranteed to be up to date before sending accept, it can do the CAS locally. Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, half-baked commit 3.jpg Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13592559#comment-13592559 ] Cristian Opris edited comment on CASSANDRA-5062 at 3/4/13 7:48 PM: --- [~slebresne] I have read your pseudo-code, seems pretty much what I was trying to describe with the version counter that counts paxos rounds (except I was thinking at row level rather than column level) I noticed however that while the leader's proposal is aborted if it has a stale round, the acceptor algorithm does not handle the case when the acceptor replica is behind. Basically in the acceptor algorithm you don't seem to handle the case where C_current.timestamp() R One way to do that is to nack the proposal indicating it needs to catch up and either expect to receive a snapshot from the leader or do a read. Also note you don't need to send the column values with the proposal. If you get quorum for the proposal you can perform the CAS locally and just send the new column value with the accept Essentially consensus is on the next column value to write, not the CAS. Since proposer is guaranteed to be up to date before sending accept, it can do the CAS locally. was (Author: onetoinfin...@yahoo.com): [~slebresne] I have read your pseudo-code, seems pretty much what I was trying to describe with the version counter that counts paxos rounds (except I was thinking at row level rather than column level) I noticed however that while the leader's proposal is aborted if it has a stale round, the acceptor algorithm does not handle the case when the acceptor replica is behind. Basically in the acceptor algorithm you don't seem to handle the case where C_current.timestamp() R One way to do that is to nack the proposal indicating it needs to catch up and either expect to receive a snapshot from the leader or do a read. Also note you don't need to send the column values with proposal. If you get quorum for the proposal you can perform the CAS locally and just send the new column value. Essentially consensus is on the next column value to write, not the CAS. Since proposer is guaranteed to be up to date before sending accept, it can do the CAS locally. Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, half-baked commit 3.jpg Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13592559#comment-13592559 ] Cristian Opris edited comment on CASSANDRA-5062 at 3/4/13 7:52 PM: --- [~slebresne] I have read your pseudo-code, seems pretty much what I was trying to describe with the version counter that counts paxos rounds (except I was thinking at row level rather than column level) I noticed however that while the leader's proposal is aborted if it has a stale round, the acceptor algorithm does not handle the case when the acceptor replica is behind. Basically in the acceptor algorithm you don't seem to handle the case where C_current.timestamp() R-1 Edit: C_current.timestamp needs to be exactly R-1 if you increment the counter on sending the proposal. One way to do that is to nack the proposal indicating it needs to catch up and either expect to receive a snapshot from the leader or do a read. Also note you don't need to send the column values with the proposal. If you get quorum for the proposal you can perform the CAS locally and just send the new column value with the accept Essentially consensus is on the next column value to write, not the CAS. Since proposer is guaranteed to be up to date before sending accept, it can do the CAS locally. was (Author: onetoinfin...@yahoo.com): [~slebresne] I have read your pseudo-code, seems pretty much what I was trying to describe with the version counter that counts paxos rounds (except I was thinking at row level rather than column level) I noticed however that while the leader's proposal is aborted if it has a stale round, the acceptor algorithm does not handle the case when the acceptor replica is behind. Basically in the acceptor algorithm you don't seem to handle the case where C_current.timestamp() R One way to do that is to nack the proposal indicating it needs to catch up and either expect to receive a snapshot from the leader or do a read. Also note you don't need to send the column values with the proposal. If you get quorum for the proposal you can perform the CAS locally and just send the new column value with the accept Essentially consensus is on the next column value to write, not the CAS. Since proposer is guaranteed to be up to date before sending accept, it can do the CAS locally. Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, half-baked commit 3.jpg Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13592634#comment-13592634 ] Cristian Opris commented on CASSANDRA-5062: --- Sorry, I've probably edited my comment after your reply. C_current.timestamp needs to be exactly R-1 if you increment the counter on sending the proposal. If it's less than the acceptor hasn't learned the previously committed value (R-1) so can't participate in round R, otherwise we're mixing up rounds. If it's more, than the proposer is behind so you already handle that. Regarding If you get quorum for the proposal you can perform the CAS locally and just send the new column value with the accept By that I meant you can do the validate part of the CAS locally, not actually write the CAS. Basically any operation (not just CAS) can be evaluated (in memory) by the proposal after it gets quorum for the proposal (which guarantees it has the latest *committed* value) so it obtains the value to send for acceptance. This is more of an optimization where you exchange and agree on values rather than operations (state transfer replication). Also solves the problem of where to validate the CAS. Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, half-baked commit 3.jpg Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13592634#comment-13592634 ] Cristian Opris edited comment on CASSANDRA-5062 at 3/4/13 9:05 PM: --- Sorry, I've probably edited my comment after your reply. C_current.timestamp needs to be exactly R-1 if you increment the counter on sending the proposal. If it's less, then the acceptor hasn't learned the previously committed value (R-1) so can't participate in round R, otherwise we're mixing up rounds. If it's more, then the proposer is behind so you already handle that. Regarding If you get quorum for the proposal you can perform the CAS locally and just send the new column value with the accept By that I meant you can do the validate part of the CAS locally, not actually write the CAS. Basically any operation (not just CAS) can be evaluated (in memory) by the proposal after it gets quorum for the proposal (which guarantees it has the latest *committed* value) so it obtains the value to send for acceptance. This is more of an optimization where you exchange and agree on values rather than operations (state transfer replication). Also solves the problem of where to validate the CAS. was (Author: onetoinfin...@yahoo.com): Sorry, I've probably edited my comment after your reply. C_current.timestamp needs to be exactly R-1 if you increment the counter on sending the proposal. If it's less than the acceptor hasn't learned the previously committed value (R-1) so can't participate in round R, otherwise we're mixing up rounds. If it's more, than the proposer is behind so you already handle that. Regarding If you get quorum for the proposal you can perform the CAS locally and just send the new column value with the accept By that I meant you can do the validate part of the CAS locally, not actually write the CAS. Basically any operation (not just CAS) can be evaluated (in memory) by the proposal after it gets quorum for the proposal (which guarantees it has the latest *committed* value) so it obtains the value to send for acceptance. This is more of an optimization where you exchange and agree on values rather than operations (state transfer replication). Also solves the problem of where to validate the CAS. Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, half-baked commit 3.jpg Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13592743#comment-13592743 ] Cristian Opris commented on CASSANDRA-5062: --- Say you have this: Proposer has committed R-1, starts round R, proposal timestamp Tn Acceptor recovers with committed R-n R-1, and has accepted value A at R-n+1 R-1 at Tm in the paxos state log. When Acceptor receives proposal, if it doesn't check R, if Tm Tn (clock mismatch) according to paxos it needs to send it's old accepted value and the proposer will have to use it to commit. It will end up committing an old value. It's an edge case but not impossible. Paxos holds within the same round, but not across rounds. This makes sense because a Paxos round just means agree on a value which once accepted by a quorum can never change. Which is why you can't have an out of date replica participate in a round. The idea is to move from quorum that committed (learned) R to quorum that accepts R+1 to quorum that commits R+1 and so on. Note the quorums don't need to be made of same components. To ensure this you maintain the invariant that *you can't propose or accept R+1 locally if you haven't committed R* So a replica can die and recover, but to recover and participate in paxos needs to learn the latest value. This also gives you consistent read (at the possible cost of an extra read paxos proposal to ensure that the last paxos round is committed if left ambiguous) Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, half-baked commit 3.jpg Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13592748#comment-13592748 ] Cristian Opris commented on CASSANDRA-5062: --- So I think what you're doing at the moment is effectively using the (R,P) tuple as the proposal number within a single continuous Paxos instance, that sometimes may agree on things and sometimes replicas learn the agreed value. Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, half-baked commit 3.jpg Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13591778#comment-13591778 ] Cristian Opris commented on CASSANDRA-5062: --- Jonathan, even if you could rely on monotonically increasing timestamps (which is a big assumption), I don't think this will work because it does not clearly demarcate between paxos rounds. So you could have a scenario where you end up with different values committed at each replica: {code} R1R2 R3 1. C0C0 C0 //initial state ts=0 2. P1 P1 //R3 initiates proposal ts=1 3. A1 A1 //accept ts=1 4.C1 //R2 has majority, commits ts=1 5. P2 P2 //R1 initiates proposal ts=2 6. A2 A2 //accept ts=2; note this breaks Paxos since R1 should have chosen A1 7. C2 //R1 commits C2 {code} After step 7, R1=C2, R2=C0, R3=C1 If a read comes in at this point, what would it resolve to ? You could say use the highest timestamp but that would require a read ALL More importantly, if a CAS request comes in, the validation of that depends on which replica it executes (unless again we do a read ALL before) The reason I suggested version counters is because this allows a replica to detect it has missed paxos rounds and needs to sync up before proceeding. The example above modified: {code} R1R2 R3 1. C0C0 C0 //initial state v=0 2. P1 P1 //R3 initiates proposal v=1 3. A1 A1 //accept ts=1 4.C1 //R2 has majority, commits ts=1 5a. P1' //R1 wants to initiate its own P1 5b. nack P1' //but rejected since already committed 5c. read C1 //read and commit C1 (finish round 1) 5d. P2 //restarts proposal with v=2 5e.P2 C1 //R2 receives P2 and notices it's missing C1 which it needs to commit first 6. A2A2 //accept v=2; this is ok for Paxos as it's truly a new round 7. C2 //R1 commits C2 {code} After step 7 R1=(C2,A2) R2=(C1,A2) R3=(C1,A1) The most ambiguous quorum is R2,R3. Let's even assume that R1 has failed. The ambiguity can still be solved by initiating a new paxos round at version v=2 which will necessarily accept and commit A2. (this follows from Paxos) So to have a consistent read, the read might perform a paxos round to commit A2. This is a sketch of a proof this is correct: - if no replica can participate in a paxos round for version V, as acceptor or proposer, until it learns and commits locally the previous version V-1 - then for Paxos to achieve a quorum of accept at V, a quorum of replicas must have committed V-1 - once a quorum has accepted the same value for V, all replicas can eventually learn and commit V by simply rerunning a paxos round at V with value Nil (this can be triggered by an attempt to write V+1, or a read as shown above) Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, half-baked commit 3.jpg Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13591811#comment-13591811 ] Cristian Opris commented on CASSANDRA-5062: --- Ok, but I think my point was that even if you can assume that (monotonic time) than it's still not correct, because when a proposal with a new value and higher timestamp than last committed comes in, accepting it over a previously accepted value would violate paxos. That is step 6 in my first example there. This at least breaks cas and cannot give consistent read However I confess I don't fully understand your solution, could you summarize or formalize a bit ? Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, half-baked commit 3.jpg Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13591828#comment-13591828 ] Cristian Opris commented on CASSANDRA-5062: --- One more example of how mostRecentCommit is ambiguous: {code} R1 R2 R3 Ct0 Ct0 Ct0//initial state at t0 Atn Atn//accept at Tn t0 Atn - Ctn //R3 commits Ctn, mostRecentCommit = tn, Accept is cleared ! Atn+mAtn+m //R3 accepts new value at tn+m tn, this is valid since accept has been cleared Atn+m - Ctn+m//ambiguous state with R1=Ctn+m, R2=Ct0, R3=Ctn, needs read ALL to resolve {code} Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, half-baked commit 3.jpg Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13591890#comment-13591890 ] Cristian Opris commented on CASSANDRA-5062: --- OK, I believe what you're proposing is very close to what I am thinking. Essentially you're using mostRecentCommit timestamp (mrc) to track the paxos instance, while I am proposing to use a sequence value that is incremented on local commit. I expect that in your case as well this epoch number let's call it is different from proposal number, which can indeed be a timestamp (timeuuid) It seems this epoch doesn't have to be sequential so timestamp could work. (I would still go with a sequence just not to depend on the clock at all, but it's not necessary) I reworked the example above with more detail, and seems correct: {code} R1 R2R3 Ct0 Ct0 Ct0 //initial state at t0 Ptn(epoch=t0) - //R3 makes a proposal numbered tn with mRC=t0 promise(Ptn) -- //R2 promises Atn Atn //accept at Tn t0 Atn - Ctn //R3 commits Ctn, mrc=tn, accept is cleared --- Ptn+m(mrc=t0)//R1 makes a proposal tn+m with mRC=t0, last it knows of --- nack (Ctn)//R3 rejects since stale mRC; send Ctn directly for R1 to learn Ctn --- Ptn+m(mrc=tn) //propose again at mRC=tn - ok //R3 promises Atn+mAtn+m //R3 accepts new value at tn+m tn, this is now valid Ctn+m {code} State: R1=(Ctn+m), R2=(Ct0,Atn), R3=(Ctn,Atn+m) Now I think this is pretty much like the variant with version counter above. To do a consistent read, the read may have to perform the completion of the paxos round for Atn+m but it's guaranteed to resolve to Ctn+m whatever quorum it reads. Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, half-baked commit 3.jpg Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13591890#comment-13591890 ] Cristian Opris edited comment on CASSANDRA-5062 at 3/3/13 11:09 PM: OK, I believe what you're proposing is very close to what I am thinking. Essentially you're using mostRecentCommit timestamp (mrc) to track the paxos instance, while I am proposing to use a sequence value that is incremented on local commit. I expect that in your case as well this epoch number let's call it is different from proposal number, which can indeed be a timestamp (timeuuid) It seems this epoch doesn't have to be sequential so timestamp could work. (I would still go with a sequence just not to depend on the clock at all, but it's not necessary) I reworked the example above with more detail, and seems correct: {code} R1 R2R3 Ct0 Ct0 Ct0 //initial state at t0 Ptn(mrc=t0) - //R3 makes a proposal numbered tn with most recent commited t0 -- ok -- //R2 promises Atn Atn //accept at Tn t0 Atn - Ctn //R3 commits Ctn, mrc=tn, accept is cleared --- Ptn+m(mrc=t0)//R1 makes a proposal tn+m with mRC=t0, last it knows of --- nack (Ctn)//R3 rejects since stale mRC; send Ctn directly for R1 to learn Ctn --- Ptn+m(mrc=tn) //propose again at mrc=tn - ok //R3 promises since mrc up to date Atn+mAtn+m //R3 accepts new value at tn+m tn Ctn+m {code} State: R1=(Ctn+m), R2=(Ct0,Atn), R3=(Ctn,Atn+m) Now I think this is pretty much like the variant with version counter above. To do a consistent read, the read may have to perform the completion of the paxos round for Atn+m but it's guaranteed to resolve to Ctn+m whatever quorum it reads. was (Author: onetoinfin...@yahoo.com): OK, I believe what you're proposing is very close to what I am thinking. Essentially you're using mostRecentCommit timestamp (mrc) to track the paxos instance, while I am proposing to use a sequence value that is incremented on local commit. I expect that in your case as well this epoch number let's call it is different from proposal number, which can indeed be a timestamp (timeuuid) It seems this epoch doesn't have to be sequential so timestamp could work. (I would still go with a sequence just not to depend on the clock at all, but it's not necessary) I reworked the example above with more detail, and seems correct: {code} R1 R2R3 Ct0 Ct0 Ct0 //initial state at t0 Ptn(epoch=t0) - //R3 makes a proposal numbered tn with mRC=t0 promise(Ptn) -- //R2 promises Atn Atn //accept at Tn t0 Atn - Ctn //R3 commits Ctn, mrc=tn, accept is cleared --- Ptn+m(mrc=t0)//R1 makes a proposal tn+m with mRC=t0, last it knows of --- nack (Ctn)//R3 rejects since stale mRC; send Ctn directly for R1 to learn Ctn --- Ptn+m(mrc=tn) //propose again at mRC=tn - ok //R3 promises Atn+mAtn+m //R3 accepts new value at tn+m tn, this is now valid Ctn+m {code} State: R1=(Ctn+m), R2=(Ct0,Atn), R3=(Ctn,Atn+m) Now I think this is pretty much like the variant with version counter above. To do a consistent read, the read may have to perform the completion of the paxos round for Atn+m but it's guaranteed to resolve to Ctn+m whatever quorum it reads. Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, half-baked commit 3.jpg Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13591568#comment-13591568 ] Cristian Opris commented on CASSANDRA-5062: --- [~luiscarneiro] What may happen with this is you read a value from the most advanced replica and then you try a CAS at a stale replica which will deny it even if it's legit, because it does not match its stale value. I think something like this may work where you track a version counter for each row and you make sure you advance paxos rounds (and version counter) one at a time per quorum. Basically the invariant is that a replica initiates or participates in paxos round V only after it has committed V-1 locally, which can happen when: - it learns a *majority* has *accepted* a value at V-1 so it can *commit* V-1 locally (i.e. paxos round V-1 is settled) - it learns that *any* replica has *committed* V-1 I am still fuzzy how this can be accomplished exactly but the invariants seem good. Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, half-baked commit 3.jpg Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13591570#comment-13591570 ] Cristian Opris commented on CASSANDRA-5062: --- Note that the version counter is per row, and this would only require keeping the last committed and the last accepted values for each row (no log necessary) Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Attachments: half-baked commit 1.jpg, half-baked commit 2.jpg, half-baked commit 3.jpg Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13590705#comment-13590705 ] Cristian Opris commented on CASSANDRA-5062: --- There is this paper that might be of interest, Consensus on Transaction Commit: http://research.microsoft.com/apps/pubs/default.aspx?id=64636 I haven't yet studied it in detail but may give some ideas. Paxos made live seems centered on the idea of having a replicated log. Not sure this applies to what we want to do. There are details on how to do that however in the papers cited, the more relevant I think: Lampson, B. W. How to build a highly available system using consensus. Schneider, F. B. Implementing fault-tolerant services using the state machine approach: A tutorial. Google has links to the papers Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589753#comment-13589753 ] Cristian Opris commented on CASSANDRA-5062: --- This WILL be exposed to Thrift as well, right ? Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Deleted] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cristian Opris updated CASSANDRA-5062: -- Comment: was deleted (was: This WILL be exposed to Thrift as well, right ?) Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589771#comment-13589771 ] Cristian Opris commented on CASSANDRA-5062: --- Jonathan, how do you plan fitting CAS into Paxos ? Paxos would give consensus, but what would the consesus be on ? The value to write ? Is the CAS run at each replica or just the proposer ? How do you make sure when you run CAS locally you have actually learned the *previous* consensus value (to compare expected with) ? Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13589775#comment-13589775 ] Cristian Opris commented on CASSANDRA-5062: --- I believe UPDATE statements in SQL return the number of rows affected. You could do the same here (for however you define row in CQL) Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588316#comment-13588316 ] Cristian Opris commented on CASSANDRA-5062: --- Zab is not Paxos just vaguely resembles it. Zab leader replicates a totally ordered log of idempotent operations to ALL followers. It requires a quorum of followers to acknowledge the write before committing on the leader, and then commits on the followers. When leader fails, the new leader is the one that is most up-to-date with the writes (highest log sequence number) so that one will necessarily have all the committed writes (If it does not have the commit for a particular write I believe it can assume it's been committed, I'm a bit unclear on this point). The new leader needs to fully synchronize all the replicas and establish a quorum before writes can resume. That may introduce a small period of unavailability. At least in ZK I believe clients connect to a single replica and may be behind the leader with reads but they will always see all the writes (including their own since they're forwarded to leader and replicated back) in consistent order Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588328#comment-13588328 ] Cristian Opris commented on CASSANDRA-5062: --- On the other hand Paxos for each CAS would be quite different. The basic approach would be to have each CAS be a full Paxos round (Phase 1: prepare/promise, Phase 2: propose/accept). In this case each round is independent and writes can happen concurrently (as opposed to Zab where all writes are applied serially cluster-wide). There doesn't even need to be a leader, that is an optimisation to ensure liveness (avoid duelling proposers). Now since full Paxos is quite expensive in terms of roundtrips, there are optimisations to reduce that (see Fast Paxos in the wikipedia article) but I have yet to study the details of that. There is also the question of how the actual CAS op would be integrated with Paxos (who does the CAS ? presumably the proposer needs to be able to do the CAS verify locally, or maybe acceptors can NACK if the CAS is rejected locally ? Would that be a valid nack in Paxos terms ?) but that can be sorted out. Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588332#comment-13588332 ] Cristian Opris commented on CASSANDRA-5062: --- Re. which storage to use for metadata, why not use a meta-column family, like for secondary indexes, or like the locks would have required ? For Zab a persistent log will be necessary, and for Paxos a way to persist the paxos round state for each row. Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588521#comment-13588521 ] Cristian Opris commented on CASSANDRA-5062: --- The Zab paper: research.yahoo.com/files/ladis08.pdf Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588521#comment-13588521 ] Cristian Opris edited comment on CASSANDRA-5062 at 2/27/13 5:12 PM: The Zab paper: http://research.yahoo.com/files/ladis08.pdf was (Author: onetoinfin...@yahoo.com): The Zab paper: research.yahoo.com/files/ladis08.pdf Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588542#comment-13588542 ] Cristian Opris commented on CASSANDRA-5062: --- In the Zab paper in 4.1 it says ??We are able to simplify the two-phase commit protocol because we do not have aborts; followers either acknowledge the leader’s proposal or they abandon the leader. *The lack of aborts also mean that we can commit once a quorum of servers ack the proposal rather than waiting for all servers to respond.* This simplified two- phase commit by itself cannot handle leader failures, so we will add recovery mode to handle leader failures.?? So basically one a proposal is acked by a quorum there is no going back (no abort). The leader has to succeed in committing that or else it will lose its leadership. If the client times out in the meantime it has to retry and find out what the result was. Presumably this can happen with regular ACID databases as well, where a client sends COMMIT TX and times out immediately after that. Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588542#comment-13588542 ] Cristian Opris edited comment on CASSANDRA-5062 at 2/27/13 5:32 PM: In the Zab paper in 4.1 it says ??We are able to simplify the two-phase commit protocol because we do not have aborts; followers either acknowledge the leader’s proposal or they abandon the leader. *The lack of aborts also mean that we can commit once a quorum of servers ack the proposal rather than waiting for all servers to respond.* This simplified two- phase commit by itself cannot handle leader failures, so we will add recovery mode to handle leader failures.?? So basically once a proposal is acked by a quorum there is no going back (no abort). The leader has to succeed in committing that or else it will lose its leadership. If the client times out in the meantime it has to retry and find out what the result was. Presumably this can happen with regular ACID databases as well, where a client sends COMMIT TX and times out immediately after that. was (Author: onetoinfin...@yahoo.com): In the Zab paper in 4.1 it says ??We are able to simplify the two-phase commit protocol because we do not have aborts; followers either acknowledge the leader’s proposal or they abandon the leader. *The lack of aborts also mean that we can commit once a quorum of servers ack the proposal rather than waiting for all servers to respond.* This simplified two- phase commit by itself cannot handle leader failures, so we will add recovery mode to handle leader failures.?? So basically one a proposal is acked by a quorum there is no going back (no abort). The leader has to succeed in committing that or else it will lose its leadership. If the client times out in the meantime it has to retry and find out what the result was. Presumably this can happen with regular ACID databases as well, where a client sends COMMIT TX and times out immediately after that. Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588567#comment-13588567 ] Cristian Opris commented on CASSANDRA-5062: --- Note that a proposal may eventually succeed on recovery even if a less than a quorum has managed to ack it before the leader fails (and the client timed out). The need for quorum writes is to be able to survive F failures out of 2F+1 replicas. Reads are not quorum, just replica local reads. Let's say we have 5 replicas, F1 leader, F4 and F5 are ignored here as they don't matter {{ 1a. F1 - proposal - F2 1b. F1 - ack - F2 2a. F1 - proposal - F3 2b. F1 - ack - F3 3a F1 - OK - client 3b F1 - COMMIT - F2,F3 }} If F1 fails immediately after step 1b, F2 would become the leader since he has the latest seq number. Now only F2 has the proposal but it can continue and commit it to the other followers. If it can't get a quorum (maybe it's partitioned in a minority) then it gives up leadership. When it rejoins the majority, it runs another recovery procedure that uses epoch numbers to determine if it needs to throw away that proposal. This is fine since no client has actually been confirmed that the proposal has been committed. This is detailed in the paper. Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588567#comment-13588567 ] Cristian Opris edited comment on CASSANDRA-5062 at 2/27/13 6:09 PM: Note that a proposal may eventually succeed on recovery even if a less than a quorum has managed to ack it before the leader fails (and the client timed out). The need for quorum writes is to be able to survive F failures out of 2F+1 replicas. Reads are not quorum, just replica local reads. Let's say we have 5 replicas, F1 leader, F4 and F5 are ignored here as they don't matter {code} 1a F1 - proposal - F2 1b F1 - ack - F2 2a F1 - proposal - F3 2b F1 - ack - F3 3a F1 - OK - client 3b F1 - COMMIT - F2,F3 {code} If F1 fails immediately after step 1b, F2 would become the leader since he has the latest seq number. Now only F2 has the proposal but it can continue and commit it to the other followers. If it can't get a quorum (maybe it's partitioned in a minority) then it gives up leadership. When it rejoins the majority, it runs another recovery procedure that uses epoch numbers to determine if it needs to throw away that proposal. This is fine since no client has actually been confirmed that the proposal has been committed. This is detailed in the paper. was (Author: onetoinfin...@yahoo.com): Note that a proposal may eventually succeed on recovery even if a less than a quorum has managed to ack it before the leader fails (and the client timed out). The need for quorum writes is to be able to survive F failures out of 2F+1 replicas. Reads are not quorum, just replica local reads. Let's say we have 5 replicas, F1 leader, F4 and F5 are ignored here as they don't matter {{ 1a. F1 - proposal - F2 1b. F1 - ack - F2 2a. F1 - proposal - F3 2b. F1 - ack - F3 3a F1 - OK - client 3b F1 - COMMIT - F2,F3 }} If F1 fails immediately after step 1b, F2 would become the leader since he has the latest seq number. Now only F2 has the proposal but it can continue and commit it to the other followers. If it can't get a quorum (maybe it's partitioned in a minority) then it gives up leadership. When it rejoins the majority, it runs another recovery procedure that uses epoch numbers to determine if it needs to throw away that proposal. This is fine since no client has actually been confirmed that the proposal has been committed. This is detailed in the paper. Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586371#comment-13586371 ] Cristian Opris edited comment on CASSANDRA-5062 at 2/26/13 6:23 PM: Afaict from the Spinnaker paper they only require ZK for fault tolerant leader election, failure detection and possibly cluster membership. (The right lower box in the diagram in 4.1) The rest of it is their actual data storage engine. A few more comments: 1. Paxos can be made very efficient particularly in stable operation scenarios. -I believe Zab devolves effectively in atomic broadcast (not even 2PC) with a stable leader. So you can normally do writes with a single roundtrip just like now.- Edit: Zab requires 4 delays (2 roundtrips) actually 2. There is a difference between what I described above and what Spinnaker does. I believe they elect a leader for the entire replica group while my description assumes 1 full paxos instance per row write. I'm not fully clear atm how this would work but I believe even that can be optimized to single roundtrips per write in normal operation (I believe it's in one of Google's papers that they piggyback the commit on the next proposal for example) Off the top of my head: coordinator assumes one of the replicas as being most up-to-date, attempts to use it as leader. Replica starts Paxos round attaching the write payload. If accepted on a majority replica can send commit. Opportunistically attaches further proposals to it. If Paxos round fails (or a number of rounds fail) it's likely the replica is behind on many rows so coordinator switches to another replica. Now this is all preliminary as I haven't fully thought this through but I think it's definitely worth investigating. While it may be a complicated protocol it has significant performance advantages over locks. Just count how many roundtrips you'd need in the wait chain algorithm. Not to mentioned handling expired/orphan locks was (Author: onetoinfin...@yahoo.com): Afaict from the Spinnaker paper they only require ZK for fault tolerant leader election, failure detection and possibly cluster membership. (The right lower box in the diagram in 4.1) The rest of it is their actual data storage engine. A few more comments: 1. Paxos can be made very efficient particularly in stable operation scenarios. I believe Zab devolves effectively in atomic broadcast (not even 2PC) with a stable leader. So you can normally do writes with a single roundtrip just like now. 2. There is a difference between what I described above and what Spinnaker does. I believe they elect a leader for the entire replica group while my description assumes 1 full paxos instance per row write. I'm not fully clear atm how this would work but I believe even that can be optimized to single roundtrips per write in normal operation (I believe it's in one of Google's papers that they piggyback the commit on the next proposal for example) Off the top of my head: coordinator assumes one of the replicas as being most up-to-date, attempts to use it as leader. Replica starts Paxos round attaching the write payload. If accepted on a majority replica can send commit. Opportunistically attaches further proposals to it. If Paxos round fails (or a number of rounds fail) it's likely the replica is behind on many rows so coordinator switches to another replica. Now this is all preliminary as I haven't fully thought this through but I think it's definitely worth investigating. While it may be a complicated protocol it has significant performance advantages over locks. Just count how many roundtrips you'd need in the wait chain algorithm. Not to mentioned handling expired/orphan locks Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586394#comment-13586394 ] Cristian Opris edited comment on CASSANDRA-5062 at 2/26/13 6:24 PM: So I guess what I'm proposing is similar to what Piotr said above: each CAS is a round of Paxos. With some cleverness this can be collapsed to Multi-Paxos. Spinnaker does leader election with ZK precisely because they did not want to implement Paxos themselves. From the paper, section 5: The replication protocol has two phases: a leader election phase, followed by a quorum phase where the leader proposes a write and the followers accept it. -That is Multi-Paxos, with first phase (leader election) handled by ZK and second phase being the steady state (propose/accept) with the actual write/commit- Edit: it's not, see below was (Author: onetoinfin...@yahoo.com): So I guess what I'm proposing is similar to what Piotr said above: each CAS is a round of Paxos. With some cleverness this can be collapsed to Multi-Paxos. Spinnaker does leader election with ZK precisely because they did not want to implement Paxos themselves. From the paper, section 5: The replication protocol has two phases: a leader election phase, followed by a quorum phase where the leader proposes a write and the followers accept it. That is Multi-Paxos, with first phase (leader election) handled by ZK and second phase being the steady state (propose/accept) with the actual write/commit Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13587399#comment-13587399 ] Cristian Opris commented on CASSANDRA-5062: --- Rereading the papers, Spinnaker, Zab and Sergio's option 3) are pretty much the same thing. The alternative is to do a true Paxos instance for each CAS round but not clear how that can be done efficiently (and simply) Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586290#comment-13586290 ] Cristian Opris commented on CASSANDRA-5062: --- This shouldn't be too complicated with Paxos leader election very similar to Spinnaker I don't think it requires changing the read/write paths at the lower level, at least not significantly. Assume for the sake of simplicity that we use a column prefix to encode the version The leader elected should always be the one that has the latest version. This allows the leader to perform read-modify-write (conditional update) locally and do a simple quorum write to propagate that if successful. The leader can also increment the version sequentially. Conflicting writes from other replicas cannot succeed because any node that wants to write needs to get itself elected reader first. Since we do quorum writes not all replicas will have the full sequence of versions but regular anti-entropy (read-repair) on quorum reads should take care of that. If the leader fails the newly elected leader necessarily will be the one that has the latest write so it can continue to do cas locally. Anti-entropy should also take care of recovery and catch-up of a replica just like now. I believe this can all be done on top of existing functionality without major changes to read/write paths You could also reuse the Zab algorithm from ZK for expediency without using having to use the entire ZK codebase. Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586290#comment-13586290 ] Cristian Opris edited comment on CASSANDRA-5062 at 2/25/13 9:14 PM: This shouldn't be too complicated with Paxos leader election very similar to Spinnaker I don't think it requires changing the read/write paths at the lower level, at least not significantly. Assume for the sake of simplicity that we use a column prefix to encode the version The leader elected should always be the one that has the latest version. This allows the leader to perform read-modify-write (conditional update) locally and do a simple quorum write to propagate that if successful. The leader can also increment the version sequentially. Conflicting writes from other replicas cannot succeed because any node that wants to write needs to get itself elected leader first. Since we do quorum writes not all replicas will have the full sequence of versions but regular anti-entropy (read-repair) on quorum reads should take care of that. If the leader fails the newly elected leader necessarily will be the one that has the latest write so it can continue to do cas locally. Anti-entropy should also take care of recovery and catch-up of a replica just like now. I believe this can all be done on top of existing functionality without major changes to read/write paths You could also reuse the Zab algorithm from ZK for expediency without using having to use the entire ZK codebase. was (Author: onetoinfin...@yahoo.com): This shouldn't be too complicated with Paxos leader election very similar to Spinnaker I don't think it requires changing the read/write paths at the lower level, at least not significantly. Assume for the sake of simplicity that we use a column prefix to encode the version The leader elected should always be the one that has the latest version. This allows the leader to perform read-modify-write (conditional update) locally and do a simple quorum write to propagate that if successful. The leader can also increment the version sequentially. Conflicting writes from other replicas cannot succeed because any node that wants to write needs to get itself elected reader first. Since we do quorum writes not all replicas will have the full sequence of versions but regular anti-entropy (read-repair) on quorum reads should take care of that. If the leader fails the newly elected leader necessarily will be the one that has the latest write so it can continue to do cas locally. Anti-entropy should also take care of recovery and catch-up of a replica just like now. I believe this can all be done on top of existing functionality without major changes to read/write paths You could also reuse the Zab algorithm from ZK for expediency without using having to use the entire ZK codebase. Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586290#comment-13586290 ] Cristian Opris edited comment on CASSANDRA-5062 at 2/25/13 9:16 PM: This shouldn't be too complicated with Paxos leader election very similar to Spinnaker I don't think it requires changing the read/write paths at the lower level, at least not significantly. Assume for the sake of simplicity that we use a column prefix to encode the version The leader elected should always be the one that has the latest version. This allows the leader to perform read-modify-write (conditional update) locally and do a simple quorum write to propagate that if successful. The leader can also increment the version sequentially. Conflicting writes from other replicas cannot succeed because any node that wants to write needs to get itself elected leader first. Since we do quorum writes not all replicas will have the full sequence of versions but regular anti-entropy (read-repair) on quorum reads should take care of that. If the leader fails the newly elected leader necessarily will be the one that has the latest write so it can continue to do cas locally. Anti-entropy should also take care of recovery and catch-up of a replica just like now. I believe this can all be done on top of existing functionality without major changes to read/write paths You could also reuse the Zab algorithm from ZK for expediency without depending on the entire ZK codebase. was (Author: onetoinfin...@yahoo.com): This shouldn't be too complicated with Paxos leader election very similar to Spinnaker I don't think it requires changing the read/write paths at the lower level, at least not significantly. Assume for the sake of simplicity that we use a column prefix to encode the version The leader elected should always be the one that has the latest version. This allows the leader to perform read-modify-write (conditional update) locally and do a simple quorum write to propagate that if successful. The leader can also increment the version sequentially. Conflicting writes from other replicas cannot succeed because any node that wants to write needs to get itself elected leader first. Since we do quorum writes not all replicas will have the full sequence of versions but regular anti-entropy (read-repair) on quorum reads should take care of that. If the leader fails the newly elected leader necessarily will be the one that has the latest write so it can continue to do cas locally. Anti-entropy should also take care of recovery and catch-up of a replica just like now. I believe this can all be done on top of existing functionality without major changes to read/write paths You could also reuse the Zab algorithm from ZK for expediency without using having to use the entire ZK codebase. Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586371#comment-13586371 ] Cristian Opris commented on CASSANDRA-5062: --- Afaict from the Spinnaker paper they only require ZK for fault tolerant leader election, failure detection and possibly cluster membership. (The right lower box in the diagram in 4.1) The rest of it their actual data storage engine. A few more comments: 1. Paxos can be made very efficient particularly in stable operation scenarios. I believe Zab devolves effectively in atomic broadcast (not even 2PC) with a stable leader. So you can normally do writes with a single roundtrip just like now. 2. There is a difference between what I described above and what Spinnaker does. I believe they elect a leader for the entire replica group while my description assumes 1 full paxos instance per row write. I'm not fully clear atm how this would work but I believe even that can be optimized to single roundtrips per write in normal operation (I believe it's in one of Google's papers that they piggyback the commit on the next proposal for example) Off the top of my head: coordinator assumes one of the replicas as being most up-to-date, attempts to use it as leader. Replica starts Paxos round attaching the write payload. If accepted on a majority replica can send commit. Opportunistically attaches further proposals to it. If Paxos round fails (or a number of rounds fail) it's likely the replica is behind on many rows so coordinator switches to another replica. Now this is all preliminary as I haven't fully thought this through but I think it's definitely worth investigating. While it may be a complicated protocol it has significan performance advantages over locks. Just count how many roundtrips you'd need in the wait chain algorithm. Not to mentioned handling expired/orphan locks Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586371#comment-13586371 ] Cristian Opris edited comment on CASSANDRA-5062 at 2/25/13 10:21 PM: - Afaict from the Spinnaker paper they only require ZK for fault tolerant leader election, failure detection and possibly cluster membership. (The right lower box in the diagram in 4.1) The rest of it is their actual data storage engine. A few more comments: 1. Paxos can be made very efficient particularly in stable operation scenarios. I believe Zab devolves effectively in atomic broadcast (not even 2PC) with a stable leader. So you can normally do writes with a single roundtrip just like now. 2. There is a difference between what I described above and what Spinnaker does. I believe they elect a leader for the entire replica group while my description assumes 1 full paxos instance per row write. I'm not fully clear atm how this would work but I believe even that can be optimized to single roundtrips per write in normal operation (I believe it's in one of Google's papers that they piggyback the commit on the next proposal for example) Off the top of my head: coordinator assumes one of the replicas as being most up-to-date, attempts to use it as leader. Replica starts Paxos round attaching the write payload. If accepted on a majority replica can send commit. Opportunistically attaches further proposals to it. If Paxos round fails (or a number of rounds fail) it's likely the replica is behind on many rows so coordinator switches to another replica. Now this is all preliminary as I haven't fully thought this through but I think it's definitely worth investigating. While it may be a complicated protocol it has significant performance advantages over locks. Just count how many roundtrips you'd need in the wait chain algorithm. Not to mentioned handling expired/orphan locks was (Author: onetoinfin...@yahoo.com): Afaict from the Spinnaker paper they only require ZK for fault tolerant leader election, failure detection and possibly cluster membership. (The right lower box in the diagram in 4.1) The rest of it their actual data storage engine. A few more comments: 1. Paxos can be made very efficient particularly in stable operation scenarios. I believe Zab devolves effectively in atomic broadcast (not even 2PC) with a stable leader. So you can normally do writes with a single roundtrip just like now. 2. There is a difference between what I described above and what Spinnaker does. I believe they elect a leader for the entire replica group while my description assumes 1 full paxos instance per row write. I'm not fully clear atm how this would work but I believe even that can be optimized to single roundtrips per write in normal operation (I believe it's in one of Google's papers that they piggyback the commit on the next proposal for example) Off the top of my head: coordinator assumes one of the replicas as being most up-to-date, attempts to use it as leader. Replica starts Paxos round attaching the write payload. If accepted on a majority replica can send commit. Opportunistically attaches further proposals to it. If Paxos round fails (or a number of rounds fail) it's likely the replica is behind on many rows so coordinator switches to another replica. Now this is all preliminary as I haven't fully thought this through but I think it's definitely worth investigating. While it may be a complicated protocol it has significan performance advantages over locks. Just count how many roundtrips you'd need in the wait chain algorithm. Not to mentioned handling expired/orphan locks Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586376#comment-13586376 ] Cristian Opris commented on CASSANDRA-5062: --- See Multi-Paxos in the wikipedia article: http://en.wikipedia.org/wiki/Paxos_%28computer_science%29#Multi-Paxos Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-5062) Support CAS
[ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586394#comment-13586394 ] Cristian Opris commented on CASSANDRA-5062: --- So I guess what I'm proposing is similar to what Piotr said above: each CAS is a round of Paxos. With some cleverness this can be collapsed to Multi-Paxos. Spinnaker does leader election with ZK precisely because they did not want to implement Paxos themselves. From the paper, section 5: The replication protocol has two phases: a leader election phase, followed by a quorum phase where the leader proposes a write and the followers accept it. That is Multi-Paxos, with first phase (leader election) handled by ZK and second phase being the steady state (propose/accept) with the actual write/commit Support CAS --- Key: CASSANDRA-5062 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Fix For: 2.0 Strong consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (CASSANDRA-4989) Expose new SliceQueryFilter features through Thrift interface
Cristian Opris created CASSANDRA-4989: - Summary: Expose new SliceQueryFilter features through Thrift interface Key: CASSANDRA-4989 URL: https://issues.apache.org/jira/browse/CASSANDRA-4989 Project: Cassandra Issue Type: Improvement Components: API Affects Versions: 1.2.0, 1.2.1, 1.3 Reporter: Cristian Opris SliceQueryFilter has some very useful new features like ability to specify a composite column prefix to group by and specify a limit of groups to return. This is very useful if for example I have a wide row with columns prefixed by timestamp and I want to retrieve the latest columns, but I don't know the column names. Say I have a row {{row - (t1, c1), (t1, c2)... (t1, cn) ... (t0,c1) ... etc}} Query slice range (t1,) group by prefix (1) limit (1) As a more general question, is the Thrift interface going to be kept up-to-date with the feature changes or will it be left behind (a mistake IMO) ? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4989) Expose new SliceQueryFilter features through Thrift interface
[ https://issues.apache.org/jira/browse/CASSANDRA-4989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13503241#comment-13503241 ] Cristian Opris commented on CASSANDRA-4989: --- Sorry if I haven't been more clear. What I'd like is to do that query efficiently when I don't know t1 precisely, I just want to get the latest columns before a time T That can be done currently with Thrift but will return all columns with time t T, while this way I can efficiently get just the latest Note that as of type queries are very common in financial type applications for example, so it's worth considering this use case. I'm not sure about the handling of deleted keys but maybe we can find a way to generalize and expose this ? I would have asked for a feature like this anyway, it just so happens that looking at the code I see this has been done to support CQL limits Since I have an object serialization client API on top of Thrift, CQL is not much use to me... Expose new SliceQueryFilter features through Thrift interface - Key: CASSANDRA-4989 URL: https://issues.apache.org/jira/browse/CASSANDRA-4989 Project: Cassandra Issue Type: Improvement Components: API Affects Versions: 1.2.0, 1.2.1, 1.3 Reporter: Cristian Opris SliceQueryFilter has some very useful new features like ability to specify a composite column prefix to group by and specify a limit of groups to return. This is very useful if for example I have a wide row with columns prefixed by timestamp and I want to retrieve the latest columns, but I don't know the column names. Say I have a row {{row - (t1, c1), (t1, c2)... (t1, cn) ... (t0,c1) ... etc}} Query slice range (t1,) group by prefix (1) limit (1) As a more general question, is the Thrift interface going to be kept up-to-date with the feature changes or will it be left behind (a mistake IMO) ? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4977) Expose new SliceQueryFilter features through Thrift interface
[ https://issues.apache.org/jira/browse/CASSANDRA-4977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13501124#comment-13501124 ] Cristian Opris commented on CASSANDRA-4977: --- This was posted by me but apparently was logged with an anonymous account at the time Expose new SliceQueryFilter features through Thrift interface - Key: CASSANDRA-4977 URL: https://issues.apache.org/jira/browse/CASSANDRA-4977 Project: Cassandra Issue Type: Improvement Components: API Affects Versions: 1.2.0 beta 2 Reporter: aaa SliceQueryFilter has some very useful new features like ability to specify a composite column prefix to group by and specify a limit of groups to return. This is very useful if for example I have a wide row with columns prefixed by timestamp and I want to retrieve the latest columns, but I don't know the column names. Say I have a row {{row - (t1, c1), (t1, c2)... (t1, cn) ... (t0,c1) ... etc}} Query slice range (t1,) group by prefix (1) limit (1) As a more general question, is the Thrift interface going to be kept up-to-date with the feature changes or will it be left behind (a mistake IMO) ? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira