[ 
https://issues.apache.org/jira/browse/CASSANDRA-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050016#comment-13050016
 ] 

Yang Yang commented on CASSANDRA-2774:
--------------------------------------

---"Once compaction has compacted the deletes, all node will reach common 
agreement." 

we probably need to look at this more closely. I was going to use the example 
of a node always changing its report due to the different merging order in each 
read,
but since you pointed out that we should allow all compaction to finish before 
making the judgement, let's look at another ill-case:

the way I contrive the following case is very similar to the often-mentioned 
compaction ill-case, and we move that effect onto network messages, because
changing the order of merging sstables in compaction is very similar to 
changing the order of message deliveries.

let's say we have 4 nodes, A B C D. all the traffic we observe is, with 
increasing timestamp():

A leader  add 1  ts=100
B leader delete  ts=200
C leader  add 2  ts=300

now the updates so far start to replicate to D: assume that D sees the 
following order: A.(add 1), C.(add 2), B.(delete), after these, D's state is:
[A:1 C:2, last_delete=200, timestamp=300]  

now let's all the traffic between A,B,C go through, and they fully resolve 
(receiving pair-wise messages and etc), so A B C all come to state: [A:nil  
C:2,  last_delete=200  timestamp=300]

now A's state and D's state are different, let's say we let A repair D,  A's 
A-shard has a lower clock, so D will win; if we let D repair A, A's A-shard is 
isDelta(), so it trumps D. as  a result we never reach agreement between A and 
D, even though traffic is allowed to flow freely.

I just started looking inside the CounterContext logic, so I could very well be 
wrong. Thanks for your time looking through this.



as to performance, I don't think it will be a significant increase: 1) most 
application use cases will increment the same counter for many times, while it 
is in memtable. it's hard to imagine that most counters will be incremented 
only once before being dumped out to sstable. only the first increment in 
memtable for each counter will suffer a read into disk, if on average each 
counter is incremented 1000 times before being flushed, the disk read cost is 
amortized over 1000 increments; 2) even if  we do the disk read, any realistic 
counter setup already needs ROW and CL>ONE, so a disk read is needed anyway 
before the client is acked. here we do an extra disk read, but when we do the 
regular disk read for CounterMutation.makeReplicationMutation()  , the file 
blocks are already brought into cache by the new extra read, so it saves time, 
and total disk access time remains roughly the same.


> one way to make counter delete work better
> ------------------------------------------
>
>                 Key: CASSANDRA-2774
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2774
>             Project: Cassandra
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Yang Yang
>         Attachments: counter_delete.diff
>
>
> current Counter does not work with delete, because different merging order of 
> sstables would produces different result, for example:
> add 1
> delete 
> add 2
> if the merging happens by 1-2, (1,2)--3  order, the result we see will be 2
> if merging is: 1--3, (1,3)--2, the result will be 3.
> the issue is that delete now can not separate out previous adds and adds 
> later than the delete. supposedly a delete is to create a completely new 
> incarnation of the counter, or a new "lifetime", or "epoch". the new approach 
> utilizes the concept of "epoch number", so that each delete bumps up the 
> epoch number. since each write is replicated (replicate on write is almost 
> always enabled in practice, if this is a concern, we could further force ROW 
> in case of delete ), so the epoch number is global to a replica set
> changes are attached, existing tests pass fine, some tests are modified since 
> the semantic is changed a bit. some cql tests do not pass in the original 
> 0.8.0 source, that's not the fault of this change.
> see details at 
> http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3cbanlktikqcglsnwtt-9hvqpseoo7sf58...@mail.gmail.com%3E
> the goal of this is to make delete work ( at least with consistent behavior, 
> yes in case of long network partition, the behavior is not ideal, but it's 
> consistent with the definition of logical clock), so that we could have 
> expiring Counters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to