[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

Fabien Rousseau (JIRA) Fri, 22 Apr 2016 13:59:01 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15254670#comment-15254670
 ]


Fabien Rousseau commented on CASSANDRA-11349:
---------------------------------------------

Sorry for not being reactive lately, I'm rather busy atm...

I'd be more than happy[1] to see this patch in the next release.
I haven't tested it yet and probably can find some time next week to test it on 
a dev cluster if it can help.
Nevertheless, I won't be able to tell if it really worked because there will 
still have some mismatches (due to CASSANDRA-11477).

I have started working on a patch which should be able to handle both 
CASSANDRA-11477 and the last edge case.

What it basically does:
 - Tracker is now an interface
 - there are two implementations: one called RegularCompactionTracker and 
another one ValidationCompactionTracker
 - the ColumnIndexer.Builder has one more optional parameter : a boolean to 
know if it is built for validation
 - the RegularCompactionTracker is identical to the existing Tracker + one 
empty method
 - the ValidationCompactionTracker is similar to the existing Tracker but 
retain only opened tombstones (most methods are thus empty)
 - the Reducer slightly changed but its behaviour is the same regarding the 
regular compactions

I can share it if you're interested (code compiles but I still haven't tested 
it at all and plan to do it soon and share it after).

[1] Just to share more information: those issues are important to us, because a 
few of our clusters are impacted and a few days after filing the bug, we 
decided to temporarily stop repairing some tables (knowing that we could live 
with inconsistencies on those tables)  which were heavily impacted by those 
bugs (each repair increased disk occupancy by a few percent), and did a major 
compaction. This resulted in two to three times less disk occupancy (One table 
shrinked from 243GB to 79GB. Note that this was not due to tombstones 
reclaiming old data because, it's been nearly a month now, and the big SSTable 
resulting from the major compaction is still there but disk usage has not grown 
that much). 

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> ---------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-11349
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Fabien Rousseau
>            Assignee: Stefan Podkowinski
>              Labels: repair
>             Fix For: 2.1.x, 2.2.x
>
>         Attachments: 11349-2.1-v2.patch, 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
>     c1 text,
>     c2 text,
>     c3 float,
>     c4 float,
>     PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

Reply via email to