[ 
https://issues.apache.org/jira/browse/CASSANDRA-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13722095#comment-13722095
 ] 

Aleksey Yeschenko commented on CASSANDRA-4775:
----------------------------------------------

[~nff], regarding the comments on the gist:

bq. Having "merge cells" means that we could support both TTLs and "put" 
operations on counters, as long as the semantics are well defined.

Can do "put" writes - true, this would become possible (and trivial - you just 
write a new merge cell). I don't see how this design would allow for TTLs in a 
sane way, though (or even insane ones).

bq. … or would we always require QUORUM writes? This could be too restrictive 
in multi-DC deployments where most people probably prefer to read and write at 
LOCAL_QUORUM. … When this set is common to all replicas (as is proposed above), 
we can only merge in QUORUM reads if we can guarantee QUORUM writes or must 
merge in reads at ALL otherwise.

Yep, this is (too) restrictive, and it's a problem. Both options are too 
restrictive - we can't require QUORUM writes (esp. in multi-DC clusters), and 
requiring ALL for merges is also unreasonable, multi-DC or not. This alone 
makes the (initial) design in the gist unsuitable as-is, IMO.

Now, regarding the first improvement suggestion (sharded replicas):

bq. if we assign a replica deterministically (based on the operation's UUID for 
example) we risk not being able to write to the counter if that particular 
replica goes down

And this is unacceptable.

bq. if we assign a replica at random, retrying would lead to duplicates

While not being a total deal-breaker, it still weakens the argument for the 
set-based design somewhat (enabling external idempotence/client retries is one 
of the justifications for the extra overhead).

Also, while set collapse becomes a local problem now, what we can actually 
merge/collapse is just a subset of all updates, not all of them. The reads (and 
cleanup during compaction) logic (and the collapse logic itself) becomes more 
complicated now. Despite there being a logical 'owner', the increments 
(including the merge cells) still need to be replicated (for RF > 1). This 
means that each replica now has increments it 'owns' and increments it doesn't. 
So now you can't just read until the first merge cell you meet and stop there, 
because it's only responsible for a subset of all writes. You have to go on 
until you read the whole row or a merge cell for every shard there is, and 
accumulation itself becomes harder, because you have to track whether or not 
you've already read a shadowing merge cell for the subset a particular 
increment belongs to and either add it or ignore it.

bq. It becomes possible to read and write at LOCAL_QUORUM using this scheme. As 
any particular replica is the only source of truth for the subset of deltas 
that it was assigned, it does by definition read ALL of its deltas and can sum 
them up with no risk of inconsistency.

You still need the write path special-casing of the current counters - that is, 
you have to send the request to just one replica, to determine the name of the 
resulting cell (uuid + owner) and it would replicate the cell to other replicas 
(this kills internal retry safety btw). If you determine the owner replica 
first and generate the cell name, and write it at LOCAL_QUORUM, and the write 
to that owner replica actually fails.. then you can't merge locally, even the 
owner's subset, you'd need at least LOCAL_QUORUM reads.

So you can't do client retries safely, can't do internal retries safely, have 
to keep the special write path, and have reads that are pretty much as complex 
as the current implementation, with the added bonus of extra space and time 
overhead :(

Regarding DC-level-sharding:

True, you can now do merges at LOCAL_ALL (which we don't have, but that's not 
an issue) or at LOCAL_QUORUM, if you mandate writes at LOCAL_QUORUM. (If the 
writes are at regular QUORUM or anything other than LOCAL_QUORUM/ALL, then you 
still need LOCAL_ALL for the merges, because QUORUM doesn't guarantee 
LOCAL_QUORUM).

bq. A configuration flag per counter CF would configure whether we require 
W.QUORUM+R.QUORUM (default) or let clients write with any CL with the downside 
that deltas can only be merged at CL.ALL.

As I wrote above, unless all the writes are at ALL or LOCAL_QUORUM, you can't 
merge strictly DC-locally :( So it's actually the choice between 
[W.LOCAL_QUORUM writes + R.LOCAL_QUORUM merges] or [W.WHATEVER writes + 
R.LOCAL_ALL merges]. And I'm not a fan of either, even though it's definitely 
better than requiring ALL.

You also keep the complexity of sharded-partitions design - you can only merge 
the subset of the updates that belongs to the DC and thus have complicated 
reads/compaction logic. Oh, and safe idempotent client retries would require 
drivers to special-case counter-retries somehow - which means they'd have to 
look inside the query. To determine that it's a counter update and not follow 
the configured retry strategy.

I do suspect that both of these designs are also sensitive to topology changes. 
Oh, and with both you can't do 'read-repair-merge' at most read consistency 
levels, and when you can, these are now also complicated by multiple subsets 
per counter.

bq. I believe that the space and time overheads are about the same as in 
Aleksey's design.

Kinda, you now have to encode the owning node (in design #2) or the owning DC 
(in design #3) along with the update id, or take away some bytes from it.

Now, there is another issue with the design in the gist and the two suggested 
improvements. We only have two mechanisms to trigger the collapse/merge - do it 
during reads (and only QUORUM reads for design #1 and at limited levels with #2 
and #3) and by building candidate lists during compaction. I'm concerned about 
hot counters that can generate tons of updates, but not read, or read with the 
updates still within counter_write_window. The read latency becomes 
unpredictable with a set-based design. With a particularly hot counter it could 
even be possible to OOM during the reads. Writes would be faster than with the 
current implementation, true, but the unpredictable read latency bothers me a 
lot.

I could also write up why the original Jonathan's suggestion wouldn't work 
either - "after gc_grace_seconds, rounded up to the nearest hour" does not 
magically remove the need to synchronize during the collapse, and implementing 
"preliminary merge" CF is far from trivial.

To summarize, I don't think that set-based designs are viable. They can be made 
either simple, but with too unreasonable limitations, or nearly as complex as 
the current implementation, but limitations still too strict, and with added 
unpredictable read latencies/storage overhead on top.

We should look into potential improvements to the current implementation 
instead: 

1. See if we can simplify it by switching to RMW without killing the speed of 
current counters (maybe it could be done with some intelligent caching or 
another not suggested yet optimization);
2. Maybe switch to having two increment-only counters (G-Counters) for each 
shard, one for increments and one for decrements, if that proves beneficial.

(If we implement both 1) and 2) then we'd have exactly the PN-Counter 
implementation described in the CRDT whitepaper, btw).
                
> Counters 2.0
> ------------
>
>                 Key: CASSANDRA-4775
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4775
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Arya Goudarzi
>            Assignee: Aleksey Yeschenko
>              Labels: counters
>             Fix For: 2.1
>
>
> The existing partitioned counters remain a source of frustration for most 
> users almost two years after being introduced.  The remaining problems are 
> inherent in the design, not something that can be fixed given enough 
> time/eyeballs.
> Ideally a solution would give us
> - similar performance
> - less special cases in the code
> - potential for a retry mechanism

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to