[
https://issues.apache.org/jira/browse/CASSANDRA-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13613693#comment-13613693
]
Sylvain Lebresne commented on CASSANDRA-4775:
---------------------------------------------
For the record, I'd like to "quickly" sum up what are the problems of the
counters 1.0, and more precisely what I think are problems inherent to the
design, and what I believe might be fixable with some effort.
The current counter implementation is based on the idea of internally keeping
one separated sub-counter (or "shard") for each replica of the counter, and
making sure that for each increment, one shard and only one is ever
incremented. The latter is ensure by the special write path of counters that:
* pick a live replica and forward it the increment
* have that replica increment it's own "shard" locally
* then have the replica send the *result* of this local shard increment to the
other replicas
This mechanism have (at least) the following problems:
# counters cannot be retried safely on timeout.
# removing counters works only halfway. If you re-increment a deleted counter
too soon, the result is somewhat random.
Those problems are largely due to the general mechanism used, not to
implementation details. That being said, on the retry problem, I'll note that
while I don't think we can fix it in the current mechanism, tickets like
CASSANDRA-3199 could mitigate it somewhat by making TimeoutException less
likely.
Other problems are more due to how the implementation works. More precisely,
they are due to how a replica proceed to incrementing it's own shard. To do
that, the implementation uses separated merging rules for "local" shards and
"remote" ones. Namely, local shards are summed during merge (so the sub-count
they contain is used as a delta) while for remote ones, the "biggest" value is
kept (where "biggest" means "the one with the biggest clock"). So for remote
shards, conflicts are handled as "latests wins" as usual. The reason for that
difference between local and remote shards is a performance one: when a replica
needs to increment his shard, it needs to do that "atomically". So if local
shard were handled like remote ones, then to increment the local shard we would
need to 1) grab a lock, 2) read the current value, 3) increment it, 4) write it
and then 5) release the lock. Instead, in the current implementation, the
replica just insert an increment to his own shard. And to find the total value
of its local shard, it just read and increments get merged on reads. In
practice, what we win is that we don't have to grab a lock.
However, I believe that "implementation detail" is responsible for a fair
amount of the pain counters are. In particular it complicates the
implementation substantially because:
* a local shard on one replica is a remote shard on another replica. We handle
this by transforming shards during deserialization, which is complex and
fragile. It's also the source of CASSANDRA-4071 (and at least one contributor
to CASSANDRA-4417).
* we have to be extremely careful not to duplicate a local shard internally or
we'll over-count. The storage engine having been initially designed with the
idea that using the same column twice was harmless, this has led to a number of
bugs.
We could change that "implementation detail". Instead we could stop
distinguishing the merge rules for local shard, and when a replica need to
increment his hard, he would read/increment/write while holding a lock to
ensure atomicity. This would likely simplify the implementation and fix
CASSANDRA-4071 and CASSANDRA-4417. Of course, this would still not fix the
other top-level problems (not being able to replay, broken remove, ....).
> Counters 2.0
> ------------
>
> Key: CASSANDRA-4775
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4775
> Project: Cassandra
> Issue Type: New Feature
> Components: Core
> Reporter: Arya Goudarzi
> Labels: counters
> Fix For: 2.0
>
>
> The existing partitioned counters remain a source of frustration for most
> users almost two years after being introduced. The remaining problems are
> inherent in the design, not something that can be fixed given enough
> time/eyeballs.
> Ideally a solution would give us
> - similar performance
> - less special cases in the code
> - potential for a retry mechanism
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira