[
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Fabien Rousseau updated CASSANDRA-11349:
----------------------------------------
Description:
We observed that repair, for some of our clusters, streamed a lot of data and
many partitions were "out of sync".
Moreover, the read repair mismatch ratio is around 3% on those clusters, which
is really high.
After investigation, it appears that, if two range tombstones exists for a
partition for the same range/interval, they're both included in the merkle tree
computation.
But, if for some reason, on another node, the two range tombstones were already
compacted into a single range tombstone, this will result in a merkle tree
difference.
Currently, this is clearly bad because MerkleTree differences are dependent on
compactions (and if a partition is deleted and created multiple times, the only
way to ensure that repair "works correctly"/"don't overstream data" is to major
compact before each repair... which is not really feasible).
Below is a list of steps allowing to easily reproduce this case:
{noformat}
ccm create test -v 2.1.13 -n 2 -s
ccm node1 cqlsh
CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy',
'replication_factor': 2};
USE test_rt;
CREATE TABLE IF NOT EXISTS table1 (
c1 text,
c2 text,
c3 float,
c4 float,
PRIMARY KEY ((c1), c2)
);
INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
ctrl ^d
# now flush only one of the two nodes
ccm node1 flush
ccm node1 cqlsh
USE test_rt;
INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
ctrl ^d
ccm node1 repair
# now grep the log and observe that there was some inconstencies detected
between nodes (while it shouldn't have detected any)
ccm node1 showlog | grep "out of sync"
{noformat}
Consequences of this are a costly repair, accumulating many small SSTables (up
to thousands for a rather short period of time when using VNodes, the time for
compaction to absorb those small files), but also an increased size on disk.
was:
We observed that repair, for some of our clusters, streamed a lot of data and
many partitions were "out of sync".
Moreover, the read repair mismatch ratio is around 3% on those clusters, which
is really high.
After investigation, it appears that, if two range tombstones exists for a
partition for the same range/interval, they're both included in the merkle tree
computation.
But, if for some reason, on another node, the two range tombstones were already
compacted into a single range tombstone, this will result in a merkle tree
difference.
Currently, this is clearly bad because MerkleTree differences are dependent on
compactions (and if a partition is deleted and created multiple times, the only
way to ensure that repair "works correctly"/"don't overstream data" is to major
compact before each repair... which is not really feasible).
Below is a list of steps allowing to easily reproduce this case:
ccm create test -v 2.1.13 -n 2 -s
ccm node1 cqlsh
CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy',
'replication_factor': 2};
USE test_rt;
CREATE TABLE IF NOT EXISTS table1 (
c1 text,
c2 text,
c3 float,
c4 float,
PRIMARY KEY ((c1), c2)
);
INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
ctrl ^d
# now flush only one of the two nodes
ccm node1 flush
ccm node1 cqlsh
USE test_rt;
INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
ctrl ^d
ccm node1 repair
# now grep the log and observe that there was some inconstencies detected
between nodes (while it shouldn't have detected any)
ccm node1 showlog | grep "out of sync"
Consequences of this are a costly repair, accumulating many small SSTables (up
to thousands for a rather short period of time when using VNodes, the time for
compaction to absorb those small files), but also an increased size on disk.
> MerkleTree mismatch when multiple range tombstones exists for the same
> partition and interval
> ---------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
> Issue Type: Bug
> Reporter: Fabien Rousseau
>
> We observed that repair, for some of our clusters, streamed a lot of data and
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters,
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a
> partition for the same range/interval, they're both included in the merkle
> tree computation.
> But, if for some reason, on another node, the two range tombstones were
> already compacted into a single range tombstone, this will result in a merkle
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent
> on compactions (and if a partition is deleted and created multiple times, the
> only way to ensure that repair "works correctly"/"don't overstream data" is
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy',
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables
> (up to thousands for a rather short period of time when using VNodes, the
> time for compaction to absorb those small files), but also an increased size
> on disk.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)