[ 
https://issues.apache.org/jira/browse/CASSANDRA-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102676#comment-17102676
 ] 

Sylvain Lebresne commented on CASSANDRA-15789:
----------------------------------------------

I had a quick look at those commits, and agrees about the fix in `LegacyLayout`.

And I have no strong objections on the 2 other parts, but wanted to remark 2 
points:
- regarding the elimination of duplicates on iterator coming from 
`LegacyLayout`: the patch currently merge the duplicates rather silently.  What 
if we have another bug in `LegacyLayout` for which row duplication is only one 
sign, but that also lose data? Are we sure we won't regret not failing on what 
would be an unknown bug?
- Regarding the duplicate check on all reads, I "think" this could have a 
measurable impact on performance for some workloads. Which isn't a reason not 
to add it, but as this impact all reads and will go in "stable" versions, do we 
want to run a few benchmarks to quantify this? Or have a way to disable the 
check?


> Rows can get duplicated in mixed major-version clusters and after full upgrade
> ------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15789
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15789
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Coordination, Local/Memtable, Local/SSTable
>            Reporter: Aleksey Yeschenko
>            Assignee: Marcus Eriksson
>            Priority: Normal
>
> In a mixed 2.X/3.X major version cluster a sequence of row deletes, 
> collection overwrites, paging, and read repair can cause 3.X nodes to split 
> individual rows into several rows with identical clustering. This happens due 
> to 2.X paging and RT semantics, and a 3.X {{LegacyLayout}} deficiency.
> To reproduce, set up a 2-node mixed major version cluster with the following 
> table:
> {code}
> CREATE TABLE distributed_test_keyspace.tlb (
>     pk int,
>     ck int,
>     v map<text, text>,
>     PRIMARY KEY (pk, ck)
> );
> {code}
> 1. Using either node as the coordinator, delete the row with ck=2 using 
> timestamp 1
> {code}
> DELETE FROM tbl USING TIMESTAMP 1 WHERE pk = 1 AND ck = 2;
> {code}
> 2. Using either node as the coordinator, insert the following 3 rows:
> {code}
> INSERT INTO tbl (pk, ck, v) VALUES (1, 1, {'e':'f'}) USING TIMESTAMP 3;
> INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 3;
> INSERT INTO tbl (pk, ck, v) VALUES (1, 3, {'i':'j'}) USING TIMESTAMP 3;
> {code}
> 3. Flush the table on both nodes
> 4. Using the 2.2 node as the coordinator, force read repar by querying the 
> table with page size = 2:
>  
> {code}
> SELECT * FROM tbl;
> {code}
> 5. Overwrite the row with ck=2 using timestamp 5:
> {code}
> INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 5;}}
> {code}
> 6. Query the 3.0 node and observe the split row:
> {code}
> cqlsh> select * from distributed_test_keyspace.tlb ;
>  pk | ck | v
> ----+----+------------
>   1 |  1 | {'e': 'f'}
>   1 |  2 | {'g': 'h'}
>   1 |  2 | {'k': 'l'}
>   1 |  3 | {'i': 'j'}
> {code}
> This happens because the read to query the second page ends up generating the 
> following mutation for the 3.0 node:
> {code}
> ColumnFamily(tbl -{deletedAt=-9223372036854775808, localDeletion=2147483647,
>              ranges=[2:v:_-2:v:!, deletedAt=2, localDeletion=1588588821]
>                     [2:v:!-2:!,   deletedAt=1, localDeletion=1588588821]
>                     [3:v:_-3:v:!, deletedAt=2, localDeletion=1588588821]}-
>              [2:v:63:false:1@3,])
> {code}
> Which on 3.0 side gets incorrectly deserialized as
> {code}
> Mutation(keyspace='distributed_test_keyspace', key='00000001', modifications=[
>   [distributed_test_keyspace.tbl] key=1 
> partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647 
> columns=[[] | [v]]
>     Row[info=[ts=-9223372036854775808] ]: ck=2 | del(v)=deletedAt=2, 
> localDeletion=1588588821, [v[c]=d ts=3]
>     Row[info=[ts=-9223372036854775808] del=deletedAt=1, 
> localDeletion=1588588821 ]: ck=2 |
>     Row[info=[ts=-9223372036854775808] ]: ck=3 | del(v)=deletedAt=2, 
> localDeletion=1588588821
> ])
> {code}
> {{LegacyLayout}} correctly interprets a range tombstone whose start and 
> finish {{collectionName}} values don't match as a wrapping fragment of a 
> legacy row deletion that's being interrupted by a collection deletion 
> (correctly) - see 
> [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1874-L1889].
>  Quoting the comment inline:
> {code}
> // Because of the way RangeTombstoneList work, we can have a tombstone where 
> only one of
> // the bound has a collectionName. That happens if we have a big tombstone A 
> (spanning one
> // or multiple rows) and a collection tombstone B. In that case, 
> RangeTombstoneList will
> // split this into 3 RTs: the first one from the beginning of A to the 
> beginning of B,
> // then B, then a third one from the end of B to the end of A. To make this 
> simpler, if
>  // we detect that case we transform the 1st and 3rd tombstone so they don't 
> end in the middle
>  // of a row (which is still correct).
> {code}
> {{LegacyLayout#addRowTombstone()}} method then chokes when it encounters such 
> a tombstone in the middle of an existing row - having seen {{v[c]=d}} first, 
> and mistakenly starts a new row, while in the middle of an existing one: (see 
> [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1500-L1501]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to