[ 
https://issues.apache.org/jira/browse/CASSANDRA-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100653#comment-17100653
 ] 

Marcus Eriksson commented on CASSANDRA-15789:
---------------------------------------------

Attaching branches which contain commits to;
* Add a in-jvm dtest to reproduce.
* Add a new {{PartitionUpdate}} method {{fromPre30Iterator}} which merges any 
subsequent duplicate rows.
* Make sure we don't create a new row in legacy layout when encountering a row 
tombstone after seeing actual data.
* Add read and compaction time detection for the duplication, with the ability 
to automatically snapshot the involved replicas to be able to investigate the 
sstables.

The 3.0 and 3.11 branches contain the fixes while trunk only has the 
duplication detection

|| branch || unit tests || dtests || jvm dtests || jvm upgrade dtests ||
|   [3.0|https://github.com/krummas/cassandra/commits/15789-3.0]  | 
[utests|https://circleci.com/gh/krummas/cassandra/3262] | 
[vnodes|https://circleci.com/gh/krummas/cassandra/3264] 
[novnodes|https://circleci.com/gh/krummas/cassandra/3265] | [jvm 
dtests|https://circleci.com/gh/krummas/cassandra/3263] | [upgrade 
dtests|https://circleci.com/gh/krummas/cassandra/3267] |
|  [3.11|https://github.com/krummas/cassandra/commits/15789-3.11] | 
[utests|https://circleci.com/gh/krummas/cassandra/3236] | 
[vnodes|https://circleci.com/gh/krummas/cassandra/3245] 
[novnodes|https://circleci.com/gh/krummas/cassandra/3244] | [jvm 
dtests|https://circleci.com/gh/krummas/cassandra/3235] | [upgrade 
dtests|https://circleci.com/gh/krummas/cassandra/3249] |
| [trunk|https://github.com/krummas/cassandra/commits/15789-trunk]| 
[utests|https://circleci.com/gh/krummas/cassandra/3238] | 
[vnodes|https://circleci.com/gh/krummas/cassandra/3247] 
[novnodes|https://circleci.com/gh/krummas/cassandra/3248] | [jvm 
dtests|https://circleci.com/gh/krummas/cassandra/3237] | [upgrade 
dtests|https://circleci.com/gh/krummas/cassandra/3252] |

The 3.11 upgrade dtest failure is expected since it uses the current 3.0 dtest 
jar

> Rows can get duplicated in mixed major-version clusters and after full upgrade
> ------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15789
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15789
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Coordination, Local/Memtable, Local/SSTable
>            Reporter: Aleksey Yeschenko
>            Assignee: Marcus Eriksson
>            Priority: Normal
>
> In a mixed 2.X/3.X major version cluster a sequence of row deletes, 
> collection overwrites, paging, and read repair can cause 3.X nodes to split 
> individual rows into several rows with identical clustering. This happens due 
> to 2.X paging and RT semantics, and a 3.X {{LegacyLayout}} deficiency.
> To reproduce, set up a 2-node mixed major version cluster with the following 
> table:
> {code}
> CREATE TABLE distributed_test_keyspace.tlb (
>     pk int,
>     ck int,
>     v map<text, text>,
>     PRIMARY KEY (pk, ck)
> );
> {code}
> 1. Using either node as the coordinator, delete the row with ck=2 using 
> timestamp 1
> {code}
> DELETE FROM tbl USING TIMESTAMP 1 WHERE pk = 1 AND ck = 2;
> {code}
> 2. Using either node as the coordinator, insert the following 3 rows:
> {code}
> INSERT INTO tbl (pk, ck, v) VALUES (1, 1, {'e':'f'}) USING TIMESTAMP 3;
> INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 3;
> INSERT INTO tbl (pk, ck, v) VALUES (1, 3, {'i':'j'}) USING TIMESTAMP 3;
> {code}
> 3. Flush the table on both nodes
> 4. Using the 2.2 node as the coordinator, force read repar by querying the 
> table with page size = 2:
>  
> {code}
> SELECT * FROM tbl;
> {code}
> 5. Overwrite the row with ck=2 using timestamp 5:
> {code}
> INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 5;}}
> {code}
> 6. Query the 3.0 node and observe the split row:
> {code}
> cqlsh> select * from distributed_test_keyspace.tlb ;
>  pk | ck | v
> ----+----+------------
>   1 |  1 | {'e': 'f'}
>   1 |  2 | {'g': 'h'}
>   1 |  2 | {'k': 'l'}
>   1 |  3 | {'i': 'j'}
> {code}
> This happens because the read to query the second page ends up generating the 
> following mutation for the 3.0 node:
> {code}
> ColumnFamily(tbl -{deletedAt=-9223372036854775808, localDeletion=2147483647,
>              ranges=[2:v:_-2:v:!, deletedAt=2, localDeletion=1588588821]
>                     [2:v:!-2:!,   deletedAt=1, localDeletion=1588588821]
>                     [3:v:_-3:v:!, deletedAt=2, localDeletion=1588588821]}-
>              [2:v:63:false:1@3,])
> {code}
> Which on 3.0 side gets incorrectly deserialized as
> {code}
> Mutation(keyspace='distributed_test_keyspace', key='00000001', modifications=[
>   [distributed_test_keyspace.tbl] key=1 
> partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647 
> columns=[[] | [v]]
>     Row[info=[ts=-9223372036854775808] ]: ck=2 | del(v)=deletedAt=2, 
> localDeletion=1588588821, [v[c]=d ts=3]
>     Row[info=[ts=-9223372036854775808] del=deletedAt=1, 
> localDeletion=1588588821 ]: ck=2 |
>     Row[info=[ts=-9223372036854775808] ]: ck=3 | del(v)=deletedAt=2, 
> localDeletion=1588588821
> ])
> {code}
> {{LegacyLayout}} correctly interprets a range tombstone whose start and 
> finish {{collectionName}} values don't match as a wrapping fragment of a 
> legacy row deletion that's being interrupted by a collection deletion 
> (correctly) - see 
> [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1874-L1889].
>  Quoting the comment inline:
> {code}
> // Because of the way RangeTombstoneList work, we can have a tombstone where 
> only one of
> // the bound has a collectionName. That happens if we have a big tombstone A 
> (spanning one
> // or multiple rows) and a collection tombstone B. In that case, 
> RangeTombstoneList will
> // split this into 3 RTs: the first one from the beginning of A to the 
> beginning of B,
> // then B, then a third one from the end of B to the end of A. To make this 
> simpler, if
>  // we detect that case we transform the 1st and 3rd tombstone so they don't 
> end in the middle
>  // of a row (which is still correct).
> {code}
> {{LegacyLayout#addRowTombstone()}} method then chokes when it encounters such 
> a tombstone in the middle of an existing row - having seen {{v[c]=d}} first, 
> and mistakenly starts a new row, while in the middle of an existing one: (see 
> [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1500-L1501]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to