[
https://issues.apache.org/jira/browse/CASSANDRA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929750#action_12929750
]
Sylvain Lebresne edited comment on CASSANDRA-1072 at 11/8/10 4:43 PM:
----------------------------------------------------------------------
I think this patch has a number of important shortcomings (all of which have
been discussed in some comments of CASSANDRA-1546, so sorry for the repetition):
# the patch uses IP addresses as node identifiers for the partitions of the
counters. This is overly fragile (a change of IP, accidental or not, could
corrupt data) and, I'm growing more and more convinced, a bad idea. An obvious
solution is to use uuids instead of IPs. However in that perspective, I believe
the approach taken by CASSANDRA-1546 to be a lot simpler (but I could be
biased) that the clean context logic of this patch. Because the clean context
logic requires a global knowledge of the node uuid affectations, while the
approach of CASSANDRA-1546 does not.
# cluster topology changes could result in data corruption, if no proper care
is taken by the user. Consider a simple cluster with a single node A (RF=1),
accepting updates on a counter c. We boostrap node B, that gets counter c in
its range (it is thus streamed to B). And now let's say that node B is
decommissioned. Counter c will be streamed back to A "as is". If, after B was
boostrapped, repair has been run on A, this is fine. But if repair wasn't run,
it'll result in a (on disk) corrupted counter, because the newly streamed parts
will be merged with the old version. And I don't think that requiring users to
run repair at the risk of losing data is the right fix. This is not unrelated
to my previous point in that I believe that with uuids, we can fix that by
renewing a given node ID on range changes. Again, the approach of
CASSANDRA-1546 where we don't need to know the affectation of node ID -> actual
node (at least not on the read and/or write path) make that much easier.
# there is a race condition during reads. During reads, a given row can be read
twice, because the switch from current memtable to memtable pending flush is
not atomic. The same is true when a flushed memtable becomes a sstable and at
the end of compaction. This is fine for normal reads, but will result in bogus
reads for counters. The patch attached to CASSANDRA-1546 proposes a fix to that.
# there is no replication on writes. Which is worst than merely not supporting
CL.QUORUM. This patch does provide any reasonable durability guarantee. And
imho, this is far too important to be simply left as a 'later improvement'.
was (Author: slebresne):
I think this patch has a number of important shortcomings (all of which
have been discussed in some comments of CASSANDRA-1546, so sorry for the
repetition):
# the patch uses IP addresses as node identifiers for the partitions of the
counters. This is overly fragile (a change of IP, accidental or not, could
corrupt data) and, I'm growing more and more convinced, a bad idea. An obvious
solution is to use uuids instead of IPs. However in that perspective, I believe
the approach taken by CASSANDRA-1546 to be a lot simpler (but I could be
biased) that the clean context logic of this patch. Because the clean context
logic requires a global knowledge of the node uuid affectations, while the
approach of CASSANDRA-1546 does not.
# cluster topology changes could result in data corruption, if no proper care
is taken by the user. Consider a simple cluster with a single node A (RF=1),
accepting updates on a counter c. We boostrap node B, that gets counter c in
its range (it is thus streamed to B). And now let's say that node B is
decommissioned. Counter c will be streamed back to A "as is". If, after B was
boostrapped, repair has been run on A, this is fine. But if repair wasn't run,
it'll result in a (on disk) corrupted counter, because the newly streamed parts
will be merged with the old version. And I don't think that requiring users to
run repair at the risk of losing data is the right fix. This is not unrelated
to my previous point in that I believe that with uuids, we can fix that by
renewing a given node ID on range changes. Again, the approach of
CASSANDRA-1546 where we don't need to know the affectation of node ID -> actual
node (at least not on the read and/or write path) make that much easier. #
there is a race condition during reads. During reads, a given row can be read
twice, because the switch from current memtable to memtable pending flush is
not atomic. The same is true when a flushed memtable becomes a sstable and at
the end of compaction. This is fine for normal reads, but will result in bogus
reads for counters. The patch attached to CASSANDRA-1546 proposes a fix to that.
# there is no replication on writes. Which is worst than merely not supporting
CL.QUORUM. This patch does provide any reasonable durability guarantee. And
imho, this is far too important to be simply left as a 'later improvement'.
> Increment counters
> ------------------
>
> Key: CASSANDRA-1072
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1072
> Project: Cassandra
> Issue Type: Sub-task
> Components: Core
> Reporter: Johan Oskarsson
> Assignee: Kelvin Kakugawa
> Attachments: CASSANDRA-1072.patch, Incrementcountersdesigndoc.pdf
>
>
> Break out the increment counters out of CASSANDRA-580. Classes are shared
> between the two features but without the plain version vector code the
> changeset becomes smaller and more manageable.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.