[ https://issues.apache.org/jira/browse/CASSANDRA-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049874#comment-13049874 ]
Yang Yang commented on CASSANDRA-2774: -------------------------------------- you are right "it just cannot return 1 (at) *that time* ", 0 or 2 is the value not stable that the system had from some past snapshot in time. but it will eventually come to answer 1: since our edge case above assumes that B has not got the deletion yet, the leader in the second increment can not be A, cuz otherwise B must have got the deletion from A, since on A the increment comes later. so B was the leader in the second increment. for C, it now has new epoch, let's say A's second increment reaches C (through repair, since A is not the leader in second increment), this increment has new epoch, so it will be accepted by C; if B's second increment reaches C, it belongs to the old epoch, it will be rejected. for B, it is still on the old epoch, after the second increment, B's count is 2 of the old epoch. but when A's increment goes to B through repair, or is reconciled in read with B, the result is going to be 1. if C's deletion goes to B, B is going to be brought more up to date to a value of 0 of new epoch. the above analysis does not go through all possible scenarios, but to give a definitive proof of the conjecture that "all nodes return *the* ordering given by client , in case of quorum read/write", I need to think more. but as I stated in my last comment, at least we can be sure that the new approach guarantees *some* common agreement eventually. it would be nice if we achieve *the* agreement in case of quorum, but that's not my main argument > one way to make counter delete work better > ------------------------------------------ > > Key: CASSANDRA-2774 > URL: https://issues.apache.org/jira/browse/CASSANDRA-2774 > Project: Cassandra > Issue Type: New Feature > Affects Versions: 0.8.0 > Reporter: Yang Yang > Attachments: counter_delete.diff > > > current Counter does not work with delete, because different merging order of > sstables would produces different result, for example: > add 1 > delete > add 2 > if the merging happens by 1-2, (1,2)--3 order, the result we see will be 2 > if merging is: 1--3, (1,3)--2, the result will be 3. > the issue is that delete now can not separate out previous adds and adds > later than the delete. supposedly a delete is to create a completely new > incarnation of the counter, or a new "lifetime", or "epoch". the new approach > utilizes the concept of "epoch number", so that each delete bumps up the > epoch number. since each write is replicated (replicate on write is almost > always enabled in practice, if this is a concern, we could further force ROW > in case of delete ), so the epoch number is global to a replica set > changes are attached, existing tests pass fine, some tests are modified since > the semantic is changed a bit. some cql tests do not pass in the original > 0.8.0 source, that's not the fault of this change. > see details at > http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3cbanlktikqcglsnwtt-9hvqpseoo7sf58...@mail.gmail.com%3E > the goal of this is to make delete work ( at least with consistent behavior, > yes in case of long network partition, the behavior is not ideal, but it's > consistent with the definition of logical clock), so that we could have > expiring Counters -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira