[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208345#comment-15208345
 ] 

Stefan Podkowinski commented on CASSANDRA-11349:
------------------------------------------------



I gave the patch some more thoughts and I'm now confident that the proposed 
change is the best way to address the issue. 

Basically what happens during validation compaction is that a scanner is 
created for each sstable. The {{CompactionIterable.Reducer}} will then create a 
{{LazilyCompactedRow}} with an iterable of {{OnDiskAtom}} for the same key in 
each sstable. The purpose of {{LazilyCompactedRow}} during validation 
compaction is to create a digest of the compacted version of all atoms that 
would represent a single row. This is done cell by cell, where each collection 
of atoms for a single cell name is consumed by {{LazilyCompactedRow.Reducer}}.
 
 The decision on whether {{LazilyCompactedRow.Reducer}} should finish to merge 
a cell and move to the next one is currently being done by 
{{AbstractCellNameType.onDiskAtomComparator}}, as evaluated by 
{{MergeIterator.ManyToOne}}. However, the comparator does not only compare by 
name, but also by {{DeletionTime}} in case of {{RangeTombstone}}. As a 
consequence, {{MergeIterator.ManyToOne}} will advance in case two 
{{RangeTombstone}} with different deletion times are read, which breaks the 
"_will be called one or more times with cells that share the same column name_" 
contract in {{LazilyCompactedRow.Reducer}}.

The submitted patch will introduce a new {{Comparator<OnDiskAtom>}} that will 
basically work like {{onDiskAtomComparator}}, but does not compare deletion 
time. As simple as that.


||2.1||2.2||
|[branch|https://github.com/spodkowinski/cassandra/tree/CASSANDRA-11349-2.1]|[branch|https://github.com/spodkowinski/cassandra/tree/CASSANDRA-11349-2.2]|
|[testall|http://cassci.datastax.com/view/Dev/view/spodkowinski/job/spodkowinski-CASSANDRA-11349-2.1-testall/]|[testall|http://cassci.datastax.com/view/Dev/view/spodkowinski/job/spodkowinski-CASSANDRA-11349-2.2-testall/]|
|[dtest|http://cassci.datastax.com/view/Dev/view/spodkowinski/job/spodkowinski-CASSANDRA-11349-2.1-dtest/]|[dtest|http://cassci.datastax.com/view/Dev/view/spodkowinski/job/spodkowinski-CASSANDRA-11349-2.2-dtest/]|


The only other places where {{LazilyCompactedRow}} is being used except 
validation compaction are the cleanup and scrub functions, which shouldn't be 
affected, as those are working on individual sstables and I assume that there's 
no case where an sstable can have multiple identical range tombstones with 
different timestamps.


> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> ---------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-11349
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Fabien Rousseau
>            Assignee: Stefan Podkowinski
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
>     c1 text,
>     c2 text,
>     c3 float,
>     c4 float,
>     PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to