[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222902#comment-15222902
 ] 

Stefan Podkowinski commented on CASSANDRA-11349:
------------------------------------------------

It makes sense just to modify {{onDiskAtomComparator}}. Given the generic name 
I assumed the comparator is used in other places as well, but since its only 
used in {{LazyCompactedRow}} we can just change the patch as suggested and 
simply remove the timestamp tie-break behaviour in {{onDiskAtomComparator}}. 

As for regular compactions, I agree with Tyler that this should not effect 
compactions in a way that it does with validation compaction. Before the patch, 
{{LazyCompactedRow}} would not reduce both RTs but instead have 
{{ColumnIndex.buildForCompaction()}} iterate over both RTs and have them added 
to the {{RangeTombstone.Tracker}}. The tracker would merge them in a way 
{{LCR.Reducer.getReduced}} would after the patch. However, I’m not fully sure 
if there could be some other for more complex cases where this still would 
cause problems.

Although the patch should fix the described issue, the way we deal with RTs 
during validation compaction is still not ideal. The problem is that LCR lacks 
some handling of relationships between RTs compared to 
{{RangeTombstone.Tracker}}. If we create digests column by column, we get wrong 
results for shadowing tombstones not sharing the same intervals.

{noformat}
CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
'replication_factor': 2};
USE test_rt;
CREATE TABLE IF NOT EXISTS table1 (
    c1 text,
    c2 text,
    c3 text,
    c4 float,
    PRIMARY KEY (c1, c2, c3)
) WITH compaction = {'class': 'SizeTieredCompactionStrategy', 'enabled': 
'false'};
DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b' AND c3 = 'c';

ccm node1 flush

DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';

ccm node1 repair test_rt table1
{noformat}


In this case the (c1, c2, c3) RT will always be repaired after it has been 
compacted with (c1, c2) on any node. 
So I’m wondering if we shouldn’t take a more bold approach here than the patch 
does. 


> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> ---------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-11349
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Fabien Rousseau
>            Assignee: Stefan Podkowinski
>              Labels: repair
>             Fix For: 2.1.x, 2.2.x
>
>         Attachments: 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
>     c1 text,
>     c2 text,
>     c3 float,
>     c4 float,
>     PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to