[ https://issues.apache.org/jira/browse/CASSANDRA-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050016#comment-13050016 ]
Yang Yang commented on CASSANDRA-2774: -------------------------------------- ---"Once compaction has compacted the deletes, all node will reach common agreement." we probably need to look at this more closely. I was going to use the example of a node always changing its report due to the different merging order in each read, but since you pointed out that we should allow all compaction to finish before making the judgement, let's look at another ill-case: the way I contrive the following case is very similar to the often-mentioned compaction ill-case, and we move that effect onto network messages, because changing the order of merging sstables in compaction is very similar to changing the order of message deliveries. let's say we have 4 nodes, A B C D. all the traffic we observe is, with increasing timestamp(): A leader add 1 ts=100 B leader delete ts=200 C leader add 2 ts=300 now the updates so far start to replicate to D: assume that D sees the following order: A.(add 1), C.(add 2), B.(delete), after these, D's state is: [A:1 C:2, last_delete=200, timestamp=300] now let's all the traffic between A,B,C go through, and they fully resolve (receiving pair-wise messages and etc), so A B C all come to state: [A:nil C:2, last_delete=200 timestamp=300] now A's state and D's state are different, let's say we let A repair D, A's A-shard has a lower clock, so D will win; if we let D repair A, A's A-shard is isDelta(), so it trumps D. as a result we never reach agreement between A and D, even though traffic is allowed to flow freely. I just started looking inside the CounterContext logic, so I could very well be wrong. Thanks for your time looking through this. as to performance, I don't think it will be a significant increase: 1) most application use cases will increment the same counter for many times, while it is in memtable. it's hard to imagine that most counters will be incremented only once before being dumped out to sstable. only the first increment in memtable for each counter will suffer a read into disk, if on average each counter is incremented 1000 times before being flushed, the disk read cost is amortized over 1000 increments; 2) even if we do the disk read, any realistic counter setup already needs ROW and CL>ONE, so a disk read is needed anyway before the client is acked. here we do an extra disk read, but when we do the regular disk read for CounterMutation.makeReplicationMutation() , the file blocks are already brought into cache by the new extra read, so it saves time, and total disk access time remains roughly the same. > one way to make counter delete work better > ------------------------------------------ > > Key: CASSANDRA-2774 > URL: https://issues.apache.org/jira/browse/CASSANDRA-2774 > Project: Cassandra > Issue Type: New Feature > Affects Versions: 0.8.0 > Reporter: Yang Yang > Attachments: counter_delete.diff > > > current Counter does not work with delete, because different merging order of > sstables would produces different result, for example: > add 1 > delete > add 2 > if the merging happens by 1-2, (1,2)--3 order, the result we see will be 2 > if merging is: 1--3, (1,3)--2, the result will be 3. > the issue is that delete now can not separate out previous adds and adds > later than the delete. supposedly a delete is to create a completely new > incarnation of the counter, or a new "lifetime", or "epoch". the new approach > utilizes the concept of "epoch number", so that each delete bumps up the > epoch number. since each write is replicated (replicate on write is almost > always enabled in practice, if this is a concern, we could further force ROW > in case of delete ), so the epoch number is global to a replica set > changes are attached, existing tests pass fine, some tests are modified since > the semantic is changed a bit. some cql tests do not pass in the original > 0.8.0 source, that's not the fault of this change. > see details at > http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3cbanlktikqcglsnwtt-9hvqpseoo7sf58...@mail.gmail.com%3E > the goal of this is to make delete work ( at least with consistent behavior, > yes in case of long network partition, the behavior is not ideal, but it's > consistent with the definition of logical clock), so that we could have > expiring Counters -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira