[jira] Issue Comment Edited: (CASSANDRA-1072) Increment counters

Sylvain Lebresne (JIRA) Mon, 08 Nov 2010 13:44:58 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929750#action_12929750
 ]


Sylvain Lebresne edited comment on CASSANDRA-1072 at 11/8/10 4:43 PM:
----------------------------------------------------------------------

I think this patch has a number of important shortcomings (all of which have 
been discussed in some comments of CASSANDRA-1546, so sorry for the repetition):

# the patch uses IP addresses as node identifiers for the partitions of the 
counters. This is overly fragile (a change of IP, accidental or not, could 
corrupt data) and, I'm growing more and more convinced, a bad idea. An obvious 
solution is to use uuids instead of IPs. However in that perspective, I believe 
the approach taken by CASSANDRA-1546 to be a lot simpler (but I could be 
biased) that the clean context logic of this patch. Because the clean context 
logic requires a global knowledge of the node uuid affectations, while the 
approach of CASSANDRA-1546 does not.
# cluster topology changes could result in data corruption, if no proper care 
is taken by the user. Consider a simple cluster with a single node A (RF=1), 
accepting updates on a counter c. We boostrap node B, that gets counter c in 
its range (it is thus streamed to B). And now let's say that node B is 
decommissioned. Counter c will be streamed back to A "as is". If, after B was 
boostrapped, repair has been run on A, this is fine. But if repair wasn't run, 
it'll result in a (on disk) corrupted counter, because the newly streamed parts 
will be merged with the old version.  And I don't think that requiring users to 
run repair at the risk of losing data is the right fix.  This is not unrelated 
to my previous point in that I believe that with uuids, we can fix that by 
renewing a given node ID on range changes. Again, the approach of 
CASSANDRA-1546 where we don't need to know the affectation of node ID -> actual 
node (at least not on the read and/or write path) make that much easier. 
# there is a race condition during reads. During reads, a given row can be read 
twice, because the switch from current memtable to memtable pending flush is 
not atomic. The same is true when a flushed memtable becomes a sstable and at 
the end of compaction. This is fine for normal reads, but will result in bogus 
reads for counters. The patch attached to CASSANDRA-1546 proposes a fix to that.
# there is no replication on writes. Which is worst than merely not supporting 
CL.QUORUM. This patch does provide any reasonable durability guarantee. And 
imho, this is far too important to be simply left as a 'later improvement'.


      was (Author: slebresne):
    I think this patch has a number of important shortcomings (all of which 
have been discussed in some comments of CASSANDRA-1546, so sorry for the 
repetition):

# the patch uses IP addresses as node identifiers for the partitions of the 
counters. This is overly fragile (a change of IP, accidental or not, could 
corrupt data) and, I'm growing more and more convinced, a bad idea. An obvious 
solution is to use uuids instead of IPs. However in that perspective, I believe 
the approach taken by CASSANDRA-1546 to be a lot simpler (but I could be 
biased) that the clean context logic of this patch. Because the clean context 
logic requires a global knowledge of the node uuid affectations, while the 
approach of CASSANDRA-1546 does not.

# cluster topology changes could result in data corruption, if no proper care 
is taken by the user. Consider a simple cluster with a single node A (RF=1), 
accepting updates on a counter c. We boostrap node B, that gets counter c in 
its range (it is thus streamed to B). And now let's say that node B is 
decommissioned. Counter c will be streamed back to A "as is". If, after B was 
boostrapped, repair has been run on A, this is fine. But if repair wasn't run, 
it'll result in a (on disk) corrupted counter, because the newly streamed parts 
will be merged with the old version.  And I don't think that requiring users to 
run repair at the risk of losing data is the right fix.  This is not unrelated 
to my previous point in that I believe that with uuids, we can fix that by 
renewing a given node ID on range changes. Again, the approach of 
CASSANDRA-1546 where we don't need to know the affectation of node ID -> actual 
node (at least not on the read and/or write path) make that much easier.  # 
there is a race condition during reads. During reads, a given row can be read 
twice, because the switch from current memtable to memtable pending flush is 
not atomic. The same is true when a flushed memtable becomes a sstable and at 
the end of compaction. This is fine for normal reads, but will result in bogus 
reads for counters. The patch attached to CASSANDRA-1546 proposes a fix to that.

# there is no replication on writes. Which is worst than merely not supporting 
CL.QUORUM. This patch does provide any reasonable durability guarantee. And 
imho, this is far too important to be simply left as a 'later improvement'.

  
> Increment counters
> ------------------
>
>                 Key: CASSANDRA-1072
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1072
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Core
>            Reporter: Johan Oskarsson
>            Assignee: Kelvin Kakugawa
>         Attachments: CASSANDRA-1072.patch, Incrementcountersdesigndoc.pdf
>
>
> Break out the increment counters out of CASSANDRA-580. Classes are shared 
> between the two features but without the plain version vector code the 
> changeset becomes smaller and more manageable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (CASSANDRA-1072) Increment counters

Reply via email to