[
https://issues.apache.org/jira/browse/CASSANDRA-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13722095#comment-13722095
]
Aleksey Yeschenko commented on CASSANDRA-4775:
----------------------------------------------
[~nff], regarding the comments on the gist:
bq. Having "merge cells" means that we could support both TTLs and "put"
operations on counters, as long as the semantics are well defined.
Can do "put" writes - true, this would become possible (and trivial - you just
write a new merge cell). I don't see how this design would allow for TTLs in a
sane way, though (or even insane ones).
bq. … or would we always require QUORUM writes? This could be too restrictive
in multi-DC deployments where most people probably prefer to read and write at
LOCAL_QUORUM. … When this set is common to all replicas (as is proposed above),
we can only merge in QUORUM reads if we can guarantee QUORUM writes or must
merge in reads at ALL otherwise.
Yep, this is (too) restrictive, and it's a problem. Both options are too
restrictive - we can't require QUORUM writes (esp. in multi-DC clusters), and
requiring ALL for merges is also unreasonable, multi-DC or not. This alone
makes the (initial) design in the gist unsuitable as-is, IMO.
Now, regarding the first improvement suggestion (sharded replicas):
bq. if we assign a replica deterministically (based on the operation's UUID for
example) we risk not being able to write to the counter if that particular
replica goes down
And this is unacceptable.
bq. if we assign a replica at random, retrying would lead to duplicates
While not being a total deal-breaker, it still weakens the argument for the
set-based design somewhat (enabling external idempotence/client retries is one
of the justifications for the extra overhead).
Also, while set collapse becomes a local problem now, what we can actually
merge/collapse is just a subset of all updates, not all of them. The reads (and
cleanup during compaction) logic (and the collapse logic itself) becomes more
complicated now. Despite there being a logical 'owner', the increments
(including the merge cells) still need to be replicated (for RF > 1). This
means that each replica now has increments it 'owns' and increments it doesn't.
So now you can't just read until the first merge cell you meet and stop there,
because it's only responsible for a subset of all writes. You have to go on
until you read the whole row or a merge cell for every shard there is, and
accumulation itself becomes harder, because you have to track whether or not
you've already read a shadowing merge cell for the subset a particular
increment belongs to and either add it or ignore it.
bq. It becomes possible to read and write at LOCAL_QUORUM using this scheme. As
any particular replica is the only source of truth for the subset of deltas
that it was assigned, it does by definition read ALL of its deltas and can sum
them up with no risk of inconsistency.
You still need the write path special-casing of the current counters - that is,
you have to send the request to just one replica, to determine the name of the
resulting cell (uuid + owner) and it would replicate the cell to other replicas
(this kills internal retry safety btw). If you determine the owner replica
first and generate the cell name, and write it at LOCAL_QUORUM, and the write
to that owner replica actually fails.. then you can't merge locally, even the
owner's subset, you'd need at least LOCAL_QUORUM reads.
So you can't do client retries safely, can't do internal retries safely, have
to keep the special write path, and have reads that are pretty much as complex
as the current implementation, with the added bonus of extra space and time
overhead :(
Regarding DC-level-sharding:
True, you can now do merges at LOCAL_ALL (which we don't have, but that's not
an issue) or at LOCAL_QUORUM, if you mandate writes at LOCAL_QUORUM. (If the
writes are at regular QUORUM or anything other than LOCAL_QUORUM/ALL, then you
still need LOCAL_ALL for the merges, because QUORUM doesn't guarantee
LOCAL_QUORUM).
bq. A configuration flag per counter CF would configure whether we require
W.QUORUM+R.QUORUM (default) or let clients write with any CL with the downside
that deltas can only be merged at CL.ALL.
As I wrote above, unless all the writes are at ALL or LOCAL_QUORUM, you can't
merge strictly DC-locally :( So it's actually the choice between
[W.LOCAL_QUORUM writes + R.LOCAL_QUORUM merges] or [W.WHATEVER writes +
R.LOCAL_ALL merges]. And I'm not a fan of either, even though it's definitely
better than requiring ALL.
You also keep the complexity of sharded-partitions design - you can only merge
the subset of the updates that belongs to the DC and thus have complicated
reads/compaction logic. Oh, and safe idempotent client retries would require
drivers to special-case counter-retries somehow - which means they'd have to
look inside the query. To determine that it's a counter update and not follow
the configured retry strategy.
I do suspect that both of these designs are also sensitive to topology changes.
Oh, and with both you can't do 'read-repair-merge' at most read consistency
levels, and when you can, these are now also complicated by multiple subsets
per counter.
bq. I believe that the space and time overheads are about the same as in
Aleksey's design.
Kinda, you now have to encode the owning node (in design #2) or the owning DC
(in design #3) along with the update id, or take away some bytes from it.
Now, there is another issue with the design in the gist and the two suggested
improvements. We only have two mechanisms to trigger the collapse/merge - do it
during reads (and only QUORUM reads for design #1 and at limited levels with #2
and #3) and by building candidate lists during compaction. I'm concerned about
hot counters that can generate tons of updates, but not read, or read with the
updates still within counter_write_window. The read latency becomes
unpredictable with a set-based design. With a particularly hot counter it could
even be possible to OOM during the reads. Writes would be faster than with the
current implementation, true, but the unpredictable read latency bothers me a
lot.
I could also write up why the original Jonathan's suggestion wouldn't work
either - "after gc_grace_seconds, rounded up to the nearest hour" does not
magically remove the need to synchronize during the collapse, and implementing
"preliminary merge" CF is far from trivial.
To summarize, I don't think that set-based designs are viable. They can be made
either simple, but with too unreasonable limitations, or nearly as complex as
the current implementation, but limitations still too strict, and with added
unpredictable read latencies/storage overhead on top.
We should look into potential improvements to the current implementation
instead:
1. See if we can simplify it by switching to RMW without killing the speed of
current counters (maybe it could be done with some intelligent caching or
another not suggested yet optimization);
2. Maybe switch to having two increment-only counters (G-Counters) for each
shard, one for increments and one for decrements, if that proves beneficial.
(If we implement both 1) and 2) then we'd have exactly the PN-Counter
implementation described in the CRDT whitepaper, btw).
> Counters 2.0
> ------------
>
> Key: CASSANDRA-4775
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4775
> Project: Cassandra
> Issue Type: New Feature
> Components: Core
> Reporter: Arya Goudarzi
> Assignee: Aleksey Yeschenko
> Labels: counters
> Fix For: 2.1
>
>
> The existing partitioned counters remain a source of frustration for most
> users almost two years after being introduced. The remaining problems are
> inherent in the design, not something that can be fixed given enough
> time/eyeballs.
> Ideally a solution would give us
> - similar performance
> - less special cases in the code
> - potential for a retry mechanism
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira