[
https://issues.apache.org/jira/browse/CASSANDRA-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13623757#comment-13623757
]
Nicolas Favre-Felix commented on CASSANDRA-4775:
------------------------------------------------
I would like to describe a design that was discussed at Acunu last year, aiming
to resolve the problems pointed out by Sylvain as well as remove the read
operation needed by replicate_on_write.
Our solution added a unique identifier per counter update operation, used to
identify duplicate commands and avoid overcounts on retry. The main problem in
storing (UUID, delta) pairs per counter is the O(N) read complexity; this is
how people implemented counters before 0.8 and it is a pretty inefficient way
of counting things.
Our idea was to merge those update pairs in the back-end, trying to always keep
a small number of deltas instead of all of them. Merging those updates requires
some level of synchronisation between the replicas, but that's not something
that Cassandra is completely adverse to as active anty-entropy also requires
all replicas to be available.
This design considered using a tree per counter, with time-based buckets
containing all increments to the counter for a given time period - say, 5
seconds by default. Once this time has passed, the bucket for the past 5
seconds is queued for synchronization amongst all replicas and eventually
replaced with an equivalent bucket containing a single increment with a UUID
built from all the updates that it replaces (using XOR would work). If the
replicas disagree on what needs to be in the bucket, they send each other
missed updates in the same way that data is exchanged during repair. If a node
is down, we keep accumulating 5-second buckets that will need to be merged
later.
The 5-second buckets are eventually merged into a minute bucket, then an hour
bucket, etc.
As an added bonus, the reduce function can be set to MIN, MAX, SUM_SQ, etc.
instead of just SUM.
Here are the main drawbacks I see for this approach:
* The implementation becomes a fair bit more complicated.
* Counters take more space that they used to.
* The replicas need to all be up for the "collapsing" operation to run. It
might just be that counters start to get slower if some of your nodes are down
for a long time. You can't merge updates with a replica down or you might lose
increments.
* We introduce an actual timestamp instead of the current binary blob.
* The implementation is not compatible with the current one.
* The performance characteristics of these counters are unknown.
* No code exists.
> Counters 2.0
> ------------
>
> Key: CASSANDRA-4775
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4775
> Project: Cassandra
> Issue Type: New Feature
> Components: Core
> Reporter: Arya Goudarzi
> Assignee: Aleksey Yeschenko
> Labels: counters
> Fix For: 2.0
>
>
> The existing partitioned counters remain a source of frustration for most
> users almost two years after being introduced. The remaining problems are
> inherent in the design, not something that can be fixed given enough
> time/eyeballs.
> Ideally a solution would give us
> - similar performance
> - less special cases in the code
> - potential for a retry mechanism
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira