[ https://issues.apache.org/jira/browse/CASSANDRA-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13623757#comment-13623757 ]
Nicolas Favre-Felix commented on CASSANDRA-4775: ------------------------------------------------ I would like to describe a design that was discussed at Acunu last year, aiming to resolve the problems pointed out by Sylvain as well as remove the read operation needed by replicate_on_write. Our solution added a unique identifier per counter update operation, used to identify duplicate commands and avoid overcounts on retry. The main problem in storing (UUID, delta) pairs per counter is the O(N) read complexity; this is how people implemented counters before 0.8 and it is a pretty inefficient way of counting things. Our idea was to merge those update pairs in the back-end, trying to always keep a small number of deltas instead of all of them. Merging those updates requires some level of synchronisation between the replicas, but that's not something that Cassandra is completely adverse to as active anty-entropy also requires all replicas to be available. This design considered using a tree per counter, with time-based buckets containing all increments to the counter for a given time period - say, 5 seconds by default. Once this time has passed, the bucket for the past 5 seconds is queued for synchronization amongst all replicas and eventually replaced with an equivalent bucket containing a single increment with a UUID built from all the updates that it replaces (using XOR would work). If the replicas disagree on what needs to be in the bucket, they send each other missed updates in the same way that data is exchanged during repair. If a node is down, we keep accumulating 5-second buckets that will need to be merged later. The 5-second buckets are eventually merged into a minute bucket, then an hour bucket, etc. As an added bonus, the reduce function can be set to MIN, MAX, SUM_SQ, etc. instead of just SUM. Here are the main drawbacks I see for this approach: * The implementation becomes a fair bit more complicated. * Counters take more space that they used to. * The replicas need to all be up for the "collapsing" operation to run. It might just be that counters start to get slower if some of your nodes are down for a long time. You can't merge updates with a replica down or you might lose increments. * We introduce an actual timestamp instead of the current binary blob. * The implementation is not compatible with the current one. * The performance characteristics of these counters are unknown. * No code exists. > Counters 2.0 > ------------ > > Key: CASSANDRA-4775 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4775 > Project: Cassandra > Issue Type: New Feature > Components: Core > Reporter: Arya Goudarzi > Assignee: Aleksey Yeschenko > Labels: counters > Fix For: 2.0 > > > The existing partitioned counters remain a source of frustration for most > users almost two years after being introduced. The remaining problems are > inherent in the design, not something that can be fixed given enough > time/eyeballs. > Ideally a solution would give us > - similar performance > - less special cases in the code > - potential for a retry mechanism -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira