[ 
https://issues.apache.org/jira/browse/CASSANDRA-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13623757#comment-13623757
 ] 

Nicolas Favre-Felix commented on CASSANDRA-4775:
------------------------------------------------

I would like to describe a design that was discussed at Acunu last year, aiming 
to resolve the problems pointed out by Sylvain as well as remove the read 
operation needed by replicate_on_write.

Our solution added a unique identifier per counter update operation, used to 
identify duplicate commands and avoid overcounts on retry. The main problem in 
storing (UUID, delta) pairs per counter is the O(N) read complexity; this is 
how people implemented counters before 0.8 and it is a pretty inefficient way 
of counting things.

Our idea was to merge those update pairs in the back-end, trying to always keep 
a small number of deltas instead of all of them. Merging those updates requires 
some level of synchronisation between the replicas, but that's not something 
that Cassandra is completely adverse to as active anty-entropy also requires 
all replicas to be available.
This design considered using a tree per counter, with time-based buckets 
containing all increments to the counter for a given time period - say, 5 
seconds by default. Once this time has passed, the bucket for the past 5 
seconds is queued for synchronization amongst all replicas and eventually 
replaced with an equivalent bucket containing a single increment with a UUID 
built from all the updates that it replaces (using XOR would work). If the 
replicas disagree on what needs to be in the bucket, they send each other 
missed updates in the same way that data is exchanged during repair. If a node 
is down, we keep accumulating 5-second buckets that will need to be merged 
later.
The 5-second buckets are eventually merged into a minute bucket, then an hour 
bucket, etc.

As an added bonus, the reduce function can be set to MIN, MAX, SUM_SQ, etc. 
instead of just SUM.

Here are the main drawbacks I see for this approach:

* The implementation becomes a fair bit more complicated.
* Counters take more space that they used to.
* The replicas need to all be up for the "collapsing" operation to run. It 
might just be that counters start to get slower if some of your nodes are down 
for a long time. You can't merge updates with a replica down or you might lose 
increments.
* We introduce an actual timestamp instead of the current binary blob.
* The implementation is not compatible with the current one.
* The performance characteristics of these counters are unknown.
* No code exists.

                
> Counters 2.0
> ------------
>
>                 Key: CASSANDRA-4775
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4775
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Arya Goudarzi
>            Assignee: Aleksey Yeschenko
>              Labels: counters
>             Fix For: 2.0
>
>
> The existing partitioned counters remain a source of frustration for most 
> users almost two years after being introduced.  The remaining problems are 
> inherent in the design, not something that can be fixed given enough 
> time/eyeballs.
> Ideally a solution would give us
> - similar performance
> - less special cases in the code
> - potential for a retry mechanism

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to