[ https://issues.apache.org/jira/browse/CASSANDRA-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049688#comment-13049688 ]
Sylvain Lebresne commented on CASSANDRA-2774: --------------------------------------------- Consider 2 nodes A, B and C with RF=2 and a given counter c whose replica set is {B, C}. Consider a single client issuing the following operations (in order) while connected to node A: # client increment c by +2 at CL.ONE # client delete c at CL.ONE # client increment c by +3 at CL.ONE # client reads c at CL.ALL The *only* valid answer the client should ever get on its last read is 3. Any other value is a break of the consistency level contract and not something we can expect people to be happy with. Any other answer means that deletes are broken (and this *is* the problem with the actual implementation). However, because the write are made at CL.ONE in the example above, at the time the read is issued, the only thing we know for sure is that each write has been received by one node, but not necessarily the same each time. Depending on the actual timing and on which node happens to be the one acknowledging each writes, when the read reaches the nodes you can have a lot of different situations including: * A and B both have received the 3 writes in the right order, they will all return 3, the 'right' answer. * A received the deletion (the two increments are still on the wire yet to be received) and B received the other two increments (the delete is still on the wire yet to be received). A will return the tombstone, B will return 5. You can assign all epoch number you want, there is no way you can return 3 to the client. It will be either 5 or 0. So the same query will result in different answers depending on the internal timing of events, and will sometimes return an answer that is a break of the contract. Removes of counters are broken and the only safe way to use them is for permanent removal with no following inserts. This patch doesn't fix it. Btw, it's not too hard to come up with the same kind of example using only QUORUM reads and writes (but you'll need one more replica and a few more steps). > one way to make counter delete work better > ------------------------------------------ > > Key: CASSANDRA-2774 > URL: https://issues.apache.org/jira/browse/CASSANDRA-2774 > Project: Cassandra > Issue Type: New Feature > Affects Versions: 0.8.0 > Reporter: Yang Yang > Attachments: counter_delete.diff > > > current Counter does not work with delete, because different merging order of > sstables would produces different result, for example: > add 1 > delete > add 2 > if the merging happens by 1-2, (1,2)--3 order, the result we see will be 2 > if merging is: 1--3, (1,3)--2, the result will be 3. > the issue is that delete now can not separate out previous adds and adds > later than the delete. supposedly a delete is to create a completely new > incarnation of the counter, or a new "lifetime", or "epoch". the new approach > utilizes the concept of "epoch number", so that each delete bumps up the > epoch number. since each write is replicated (replicate on write is almost > always enabled in practice, if this is a concern, we could further force ROW > in case of delete ), so the epoch number is global to a replica set > changes are attached, existing tests pass fine, some tests are modified since > the semantic is changed a bit. some cql tests do not pass in the original > 0.8.0 source, that's not the fault of this change. > see details at > http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3cbanlktikqcglsnwtt-9hvqpseoo7sf58...@mail.gmail.com%3E > the goal of this is to make delete work ( at least with consistent behavior, > yes in case of long network partition, the behavior is not ideal, but it's > consistent with the definition of logical clock), so that we could have > expiring Counters -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira