[jira] [Updated] (CASSANDRA-15789) Rows can get duplicated in mixed major-version clusters and after full upgrade

2020-10-21 Thread Michael Semb Wever (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-15789:
---
Fix Version/s: (was: 4.0-alpha)
   4.0-beta1
   4.0

> Rows can get duplicated in mixed major-version clusters and after full upgrade
> --
>
> Key: CASSANDRA-15789
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15789
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Coordination, Local/Memtable, Local/SSTable
>Reporter: Aleksey Yeschenko
>Assignee: Marcus Eriksson
>Priority: Normal
> Fix For: 3.0.21, 3.11.7, 4.0, 4.0-beta1
>
>
> In a mixed 2.X/3.X major version cluster a sequence of row deletes, 
> collection overwrites, paging, and read repair can cause 3.X nodes to split 
> individual rows into several rows with identical clustering. This happens due 
> to 2.X paging and RT semantics, and a 3.X {{LegacyLayout}} deficiency.
> To reproduce, set up a 2-node mixed major version cluster with the following 
> table:
> {code}
> CREATE TABLE distributed_test_keyspace.tlb (
> pk int,
> ck int,
> v map,
> PRIMARY KEY (pk, ck)
> );
> {code}
> 1. Using either node as the coordinator, delete the row with ck=2 using 
> timestamp 1
> {code}
> DELETE FROM tbl USING TIMESTAMP 1 WHERE pk = 1 AND ck = 2;
> {code}
> 2. Using either node as the coordinator, insert the following 3 rows:
> {code}
> INSERT INTO tbl (pk, ck, v) VALUES (1, 1, {'e':'f'}) USING TIMESTAMP 3;
> INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 3;
> INSERT INTO tbl (pk, ck, v) VALUES (1, 3, {'i':'j'}) USING TIMESTAMP 3;
> {code}
> 3. Flush the table on both nodes
> 4. Using the 2.2 node as the coordinator, force read repar by querying the 
> table with page size = 2:
>  
> {code}
> SELECT * FROM tbl;
> {code}
> 5. Overwrite the row with ck=2 using timestamp 5:
> {code}
> INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 5;}}
> {code}
> 6. Query the 3.0 node and observe the split row:
> {code}
> cqlsh> select * from distributed_test_keyspace.tlb ;
>  pk | ck | v
> ++
>   1 |  1 | {'e': 'f'}
>   1 |  2 | {'g': 'h'}
>   1 |  2 | {'k': 'l'}
>   1 |  3 | {'i': 'j'}
> {code}
> This happens because the read to query the second page ends up generating the 
> following mutation for the 3.0 node:
> {code}
> ColumnFamily(tbl -{deletedAt=-9223372036854775808, localDeletion=2147483647,
>  ranges=[2:v:_-2:v:!, deletedAt=2, localDeletion=1588588821]
> [2:v:!-2:!,   deletedAt=1, localDeletion=1588588821]
> [3:v:_-3:v:!, deletedAt=2, localDeletion=1588588821]}-
>  [2:v:63:false:1@3,])
> {code}
> Which on 3.0 side gets incorrectly deserialized as
> {code}
> Mutation(keyspace='distributed_test_keyspace', key='0001', modifications=[
>   [distributed_test_keyspace.tbl] key=1 
> partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647 
> columns=[[] | [v]]
> Row[info=[ts=-9223372036854775808] ]: ck=2 | del(v)=deletedAt=2, 
> localDeletion=1588588821, [v[c]=d ts=3]
> Row[info=[ts=-9223372036854775808] del=deletedAt=1, 
> localDeletion=1588588821 ]: ck=2 |
> Row[info=[ts=-9223372036854775808] ]: ck=3 | del(v)=deletedAt=2, 
> localDeletion=1588588821
> ])
> {code}
> {{LegacyLayout}} correctly interprets a range tombstone whose start and 
> finish {{collectionName}} values don't match as a wrapping fragment of a 
> legacy row deletion that's being interrupted by a collection deletion 
> (correctly) - see 
> [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1874-L1889].
>  Quoting the comment inline:
> {code}
> // Because of the way RangeTombstoneList work, we can have a tombstone where 
> only one of
> // the bound has a collectionName. That happens if we have a big tombstone A 
> (spanning one
> // or multiple rows) and a collection tombstone B. In that case, 
> RangeTombstoneList will
> // split this into 3 RTs: the first one from the beginning of A to the 
> beginning of B,
> // then B, then a third one from the end of B to the end of A. To make this 
> simpler, if
>  // we detect that case we transform the 1st and 3rd tombstone so they don't 
> end in the middle
>  // of a row (which is still correct).
> {code}
> {{LegacyLayout#addRowTombstone()}} method then chokes when it encounters such 
> a tombstone in the middle of an existing row - having seen {{v[c]=d}} first, 
> and mistakenly starts a new row, while in the middle of an existing one: (see 
> [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1500-L1501]).




[jira] [Updated] (CASSANDRA-15789) Rows can get duplicated in mixed major-version clusters and after full upgrade

2020-05-20 Thread Marcus Eriksson (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcus Eriksson updated CASSANDRA-15789:

  Fix Version/s: 4.0-alpha
 3.11.7
 3.0.21
  Since Version: 3.0 alpha 1
Source Control Link: 
https://github.com/apache/cassandra/commit/4d42c189fa82b32fd93ae42a164b91e4db62992e
 Resolution: Fixed
 Status: Resolved  (was: Ready to Commit)

and committed, thanks!

> Rows can get duplicated in mixed major-version clusters and after full upgrade
> --
>
> Key: CASSANDRA-15789
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15789
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Coordination, Local/Memtable, Local/SSTable
>Reporter: Aleksey Yeschenko
>Assignee: Marcus Eriksson
>Priority: Normal
> Fix For: 3.0.21, 3.11.7, 4.0-alpha
>
>
> In a mixed 2.X/3.X major version cluster a sequence of row deletes, 
> collection overwrites, paging, and read repair can cause 3.X nodes to split 
> individual rows into several rows with identical clustering. This happens due 
> to 2.X paging and RT semantics, and a 3.X {{LegacyLayout}} deficiency.
> To reproduce, set up a 2-node mixed major version cluster with the following 
> table:
> {code}
> CREATE TABLE distributed_test_keyspace.tlb (
> pk int,
> ck int,
> v map,
> PRIMARY KEY (pk, ck)
> );
> {code}
> 1. Using either node as the coordinator, delete the row with ck=2 using 
> timestamp 1
> {code}
> DELETE FROM tbl USING TIMESTAMP 1 WHERE pk = 1 AND ck = 2;
> {code}
> 2. Using either node as the coordinator, insert the following 3 rows:
> {code}
> INSERT INTO tbl (pk, ck, v) VALUES (1, 1, {'e':'f'}) USING TIMESTAMP 3;
> INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 3;
> INSERT INTO tbl (pk, ck, v) VALUES (1, 3, {'i':'j'}) USING TIMESTAMP 3;
> {code}
> 3. Flush the table on both nodes
> 4. Using the 2.2 node as the coordinator, force read repar by querying the 
> table with page size = 2:
>  
> {code}
> SELECT * FROM tbl;
> {code}
> 5. Overwrite the row with ck=2 using timestamp 5:
> {code}
> INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 5;}}
> {code}
> 6. Query the 3.0 node and observe the split row:
> {code}
> cqlsh> select * from distributed_test_keyspace.tlb ;
>  pk | ck | v
> ++
>   1 |  1 | {'e': 'f'}
>   1 |  2 | {'g': 'h'}
>   1 |  2 | {'k': 'l'}
>   1 |  3 | {'i': 'j'}
> {code}
> This happens because the read to query the second page ends up generating the 
> following mutation for the 3.0 node:
> {code}
> ColumnFamily(tbl -{deletedAt=-9223372036854775808, localDeletion=2147483647,
>  ranges=[2:v:_-2:v:!, deletedAt=2, localDeletion=1588588821]
> [2:v:!-2:!,   deletedAt=1, localDeletion=1588588821]
> [3:v:_-3:v:!, deletedAt=2, localDeletion=1588588821]}-
>  [2:v:63:false:1@3,])
> {code}
> Which on 3.0 side gets incorrectly deserialized as
> {code}
> Mutation(keyspace='distributed_test_keyspace', key='0001', modifications=[
>   [distributed_test_keyspace.tbl] key=1 
> partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647 
> columns=[[] | [v]]
> Row[info=[ts=-9223372036854775808] ]: ck=2 | del(v)=deletedAt=2, 
> localDeletion=1588588821, [v[c]=d ts=3]
> Row[info=[ts=-9223372036854775808] del=deletedAt=1, 
> localDeletion=1588588821 ]: ck=2 |
> Row[info=[ts=-9223372036854775808] ]: ck=3 | del(v)=deletedAt=2, 
> localDeletion=1588588821
> ])
> {code}
> {{LegacyLayout}} correctly interprets a range tombstone whose start and 
> finish {{collectionName}} values don't match as a wrapping fragment of a 
> legacy row deletion that's being interrupted by a collection deletion 
> (correctly) - see 
> [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1874-L1889].
>  Quoting the comment inline:
> {code}
> // Because of the way RangeTombstoneList work, we can have a tombstone where 
> only one of
> // the bound has a collectionName. That happens if we have a big tombstone A 
> (spanning one
> // or multiple rows) and a collection tombstone B. In that case, 
> RangeTombstoneList will
> // split this into 3 RTs: the first one from the beginning of A to the 
> beginning of B,
> // then B, then a third one from the end of B to the end of A. To make this 
> simpler, if
>  // we detect that case we transform the 1st and 3rd tombstone so they don't 
> end in the middle
>  // of a row (which is still correct).
> {code}
> {{LegacyLayout#addRowTombstone()}} method then chokes when it encounters such 
> a tombstone in the middle of an 

[jira] [Updated] (CASSANDRA-15789) Rows can get duplicated in mixed major-version clusters and after full upgrade

2020-05-20 Thread Marcus Eriksson (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcus Eriksson updated CASSANDRA-15789:

Status: In Progress  (was: Patch Available)

> Rows can get duplicated in mixed major-version clusters and after full upgrade
> --
>
> Key: CASSANDRA-15789
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15789
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Coordination, Local/Memtable, Local/SSTable
>Reporter: Aleksey Yeschenko
>Assignee: Marcus Eriksson
>Priority: Normal
>
> In a mixed 2.X/3.X major version cluster a sequence of row deletes, 
> collection overwrites, paging, and read repair can cause 3.X nodes to split 
> individual rows into several rows with identical clustering. This happens due 
> to 2.X paging and RT semantics, and a 3.X {{LegacyLayout}} deficiency.
> To reproduce, set up a 2-node mixed major version cluster with the following 
> table:
> {code}
> CREATE TABLE distributed_test_keyspace.tlb (
> pk int,
> ck int,
> v map,
> PRIMARY KEY (pk, ck)
> );
> {code}
> 1. Using either node as the coordinator, delete the row with ck=2 using 
> timestamp 1
> {code}
> DELETE FROM tbl USING TIMESTAMP 1 WHERE pk = 1 AND ck = 2;
> {code}
> 2. Using either node as the coordinator, insert the following 3 rows:
> {code}
> INSERT INTO tbl (pk, ck, v) VALUES (1, 1, {'e':'f'}) USING TIMESTAMP 3;
> INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 3;
> INSERT INTO tbl (pk, ck, v) VALUES (1, 3, {'i':'j'}) USING TIMESTAMP 3;
> {code}
> 3. Flush the table on both nodes
> 4. Using the 2.2 node as the coordinator, force read repar by querying the 
> table with page size = 2:
>  
> {code}
> SELECT * FROM tbl;
> {code}
> 5. Overwrite the row with ck=2 using timestamp 5:
> {code}
> INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 5;}}
> {code}
> 6. Query the 3.0 node and observe the split row:
> {code}
> cqlsh> select * from distributed_test_keyspace.tlb ;
>  pk | ck | v
> ++
>   1 |  1 | {'e': 'f'}
>   1 |  2 | {'g': 'h'}
>   1 |  2 | {'k': 'l'}
>   1 |  3 | {'i': 'j'}
> {code}
> This happens because the read to query the second page ends up generating the 
> following mutation for the 3.0 node:
> {code}
> ColumnFamily(tbl -{deletedAt=-9223372036854775808, localDeletion=2147483647,
>  ranges=[2:v:_-2:v:!, deletedAt=2, localDeletion=1588588821]
> [2:v:!-2:!,   deletedAt=1, localDeletion=1588588821]
> [3:v:_-3:v:!, deletedAt=2, localDeletion=1588588821]}-
>  [2:v:63:false:1@3,])
> {code}
> Which on 3.0 side gets incorrectly deserialized as
> {code}
> Mutation(keyspace='distributed_test_keyspace', key='0001', modifications=[
>   [distributed_test_keyspace.tbl] key=1 
> partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647 
> columns=[[] | [v]]
> Row[info=[ts=-9223372036854775808] ]: ck=2 | del(v)=deletedAt=2, 
> localDeletion=1588588821, [v[c]=d ts=3]
> Row[info=[ts=-9223372036854775808] del=deletedAt=1, 
> localDeletion=1588588821 ]: ck=2 |
> Row[info=[ts=-9223372036854775808] ]: ck=3 | del(v)=deletedAt=2, 
> localDeletion=1588588821
> ])
> {code}
> {{LegacyLayout}} correctly interprets a range tombstone whose start and 
> finish {{collectionName}} values don't match as a wrapping fragment of a 
> legacy row deletion that's being interrupted by a collection deletion 
> (correctly) - see 
> [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1874-L1889].
>  Quoting the comment inline:
> {code}
> // Because of the way RangeTombstoneList work, we can have a tombstone where 
> only one of
> // the bound has a collectionName. That happens if we have a big tombstone A 
> (spanning one
> // or multiple rows) and a collection tombstone B. In that case, 
> RangeTombstoneList will
> // split this into 3 RTs: the first one from the beginning of A to the 
> beginning of B,
> // then B, then a third one from the end of B to the end of A. To make this 
> simpler, if
>  // we detect that case we transform the 1st and 3rd tombstone so they don't 
> end in the middle
>  // of a row (which is still correct).
> {code}
> {{LegacyLayout#addRowTombstone()}} method then chokes when it encounters such 
> a tombstone in the middle of an existing row - having seen {{v[c]=d}} first, 
> and mistakenly starts a new row, while in the middle of an existing one: (see 
> [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1500-L1501]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CASSANDRA-15789) Rows can get duplicated in mixed major-version clusters and after full upgrade

2020-05-20 Thread Marcus Eriksson (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcus Eriksson updated CASSANDRA-15789:

Status: Patch Available  (was: In Progress)

> Rows can get duplicated in mixed major-version clusters and after full upgrade
> --
>
> Key: CASSANDRA-15789
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15789
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Coordination, Local/Memtable, Local/SSTable
>Reporter: Aleksey Yeschenko
>Assignee: Marcus Eriksson
>Priority: Normal
>
> In a mixed 2.X/3.X major version cluster a sequence of row deletes, 
> collection overwrites, paging, and read repair can cause 3.X nodes to split 
> individual rows into several rows with identical clustering. This happens due 
> to 2.X paging and RT semantics, and a 3.X {{LegacyLayout}} deficiency.
> To reproduce, set up a 2-node mixed major version cluster with the following 
> table:
> {code}
> CREATE TABLE distributed_test_keyspace.tlb (
> pk int,
> ck int,
> v map,
> PRIMARY KEY (pk, ck)
> );
> {code}
> 1. Using either node as the coordinator, delete the row with ck=2 using 
> timestamp 1
> {code}
> DELETE FROM tbl USING TIMESTAMP 1 WHERE pk = 1 AND ck = 2;
> {code}
> 2. Using either node as the coordinator, insert the following 3 rows:
> {code}
> INSERT INTO tbl (pk, ck, v) VALUES (1, 1, {'e':'f'}) USING TIMESTAMP 3;
> INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 3;
> INSERT INTO tbl (pk, ck, v) VALUES (1, 3, {'i':'j'}) USING TIMESTAMP 3;
> {code}
> 3. Flush the table on both nodes
> 4. Using the 2.2 node as the coordinator, force read repar by querying the 
> table with page size = 2:
>  
> {code}
> SELECT * FROM tbl;
> {code}
> 5. Overwrite the row with ck=2 using timestamp 5:
> {code}
> INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 5;}}
> {code}
> 6. Query the 3.0 node and observe the split row:
> {code}
> cqlsh> select * from distributed_test_keyspace.tlb ;
>  pk | ck | v
> ++
>   1 |  1 | {'e': 'f'}
>   1 |  2 | {'g': 'h'}
>   1 |  2 | {'k': 'l'}
>   1 |  3 | {'i': 'j'}
> {code}
> This happens because the read to query the second page ends up generating the 
> following mutation for the 3.0 node:
> {code}
> ColumnFamily(tbl -{deletedAt=-9223372036854775808, localDeletion=2147483647,
>  ranges=[2:v:_-2:v:!, deletedAt=2, localDeletion=1588588821]
> [2:v:!-2:!,   deletedAt=1, localDeletion=1588588821]
> [3:v:_-3:v:!, deletedAt=2, localDeletion=1588588821]}-
>  [2:v:63:false:1@3,])
> {code}
> Which on 3.0 side gets incorrectly deserialized as
> {code}
> Mutation(keyspace='distributed_test_keyspace', key='0001', modifications=[
>   [distributed_test_keyspace.tbl] key=1 
> partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647 
> columns=[[] | [v]]
> Row[info=[ts=-9223372036854775808] ]: ck=2 | del(v)=deletedAt=2, 
> localDeletion=1588588821, [v[c]=d ts=3]
> Row[info=[ts=-9223372036854775808] del=deletedAt=1, 
> localDeletion=1588588821 ]: ck=2 |
> Row[info=[ts=-9223372036854775808] ]: ck=3 | del(v)=deletedAt=2, 
> localDeletion=1588588821
> ])
> {code}
> {{LegacyLayout}} correctly interprets a range tombstone whose start and 
> finish {{collectionName}} values don't match as a wrapping fragment of a 
> legacy row deletion that's being interrupted by a collection deletion 
> (correctly) - see 
> [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1874-L1889].
>  Quoting the comment inline:
> {code}
> // Because of the way RangeTombstoneList work, we can have a tombstone where 
> only one of
> // the bound has a collectionName. That happens if we have a big tombstone A 
> (spanning one
> // or multiple rows) and a collection tombstone B. In that case, 
> RangeTombstoneList will
> // split this into 3 RTs: the first one from the beginning of A to the 
> beginning of B,
> // then B, then a third one from the end of B to the end of A. To make this 
> simpler, if
>  // we detect that case we transform the 1st and 3rd tombstone so they don't 
> end in the middle
>  // of a row (which is still correct).
> {code}
> {{LegacyLayout#addRowTombstone()}} method then chokes when it encounters such 
> a tombstone in the middle of an existing row - having seen {{v[c]=d}} first, 
> and mistakenly starts a new row, while in the middle of an existing one: (see 
> [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1500-L1501]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CASSANDRA-15789) Rows can get duplicated in mixed major-version clusters and after full upgrade

2020-05-20 Thread Marcus Eriksson (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcus Eriksson updated CASSANDRA-15789:

Reviewers: Alex Petrov, Sylvain Lebresne, Marcus Eriksson  (was: Alex 
Petrov, Marcus Eriksson, Sylvain Lebresne)
   Alex Petrov, Sylvain Lebresne, Marcus Eriksson  (was: Alex 
Petrov, Sylvain Lebresne)
   Status: Review In Progress  (was: Patch Available)

> Rows can get duplicated in mixed major-version clusters and after full upgrade
> --
>
> Key: CASSANDRA-15789
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15789
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Coordination, Local/Memtable, Local/SSTable
>Reporter: Aleksey Yeschenko
>Assignee: Marcus Eriksson
>Priority: Normal
>
> In a mixed 2.X/3.X major version cluster a sequence of row deletes, 
> collection overwrites, paging, and read repair can cause 3.X nodes to split 
> individual rows into several rows with identical clustering. This happens due 
> to 2.X paging and RT semantics, and a 3.X {{LegacyLayout}} deficiency.
> To reproduce, set up a 2-node mixed major version cluster with the following 
> table:
> {code}
> CREATE TABLE distributed_test_keyspace.tlb (
> pk int,
> ck int,
> v map,
> PRIMARY KEY (pk, ck)
> );
> {code}
> 1. Using either node as the coordinator, delete the row with ck=2 using 
> timestamp 1
> {code}
> DELETE FROM tbl USING TIMESTAMP 1 WHERE pk = 1 AND ck = 2;
> {code}
> 2. Using either node as the coordinator, insert the following 3 rows:
> {code}
> INSERT INTO tbl (pk, ck, v) VALUES (1, 1, {'e':'f'}) USING TIMESTAMP 3;
> INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 3;
> INSERT INTO tbl (pk, ck, v) VALUES (1, 3, {'i':'j'}) USING TIMESTAMP 3;
> {code}
> 3. Flush the table on both nodes
> 4. Using the 2.2 node as the coordinator, force read repar by querying the 
> table with page size = 2:
>  
> {code}
> SELECT * FROM tbl;
> {code}
> 5. Overwrite the row with ck=2 using timestamp 5:
> {code}
> INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 5;}}
> {code}
> 6. Query the 3.0 node and observe the split row:
> {code}
> cqlsh> select * from distributed_test_keyspace.tlb ;
>  pk | ck | v
> ++
>   1 |  1 | {'e': 'f'}
>   1 |  2 | {'g': 'h'}
>   1 |  2 | {'k': 'l'}
>   1 |  3 | {'i': 'j'}
> {code}
> This happens because the read to query the second page ends up generating the 
> following mutation for the 3.0 node:
> {code}
> ColumnFamily(tbl -{deletedAt=-9223372036854775808, localDeletion=2147483647,
>  ranges=[2:v:_-2:v:!, deletedAt=2, localDeletion=1588588821]
> [2:v:!-2:!,   deletedAt=1, localDeletion=1588588821]
> [3:v:_-3:v:!, deletedAt=2, localDeletion=1588588821]}-
>  [2:v:63:false:1@3,])
> {code}
> Which on 3.0 side gets incorrectly deserialized as
> {code}
> Mutation(keyspace='distributed_test_keyspace', key='0001', modifications=[
>   [distributed_test_keyspace.tbl] key=1 
> partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647 
> columns=[[] | [v]]
> Row[info=[ts=-9223372036854775808] ]: ck=2 | del(v)=deletedAt=2, 
> localDeletion=1588588821, [v[c]=d ts=3]
> Row[info=[ts=-9223372036854775808] del=deletedAt=1, 
> localDeletion=1588588821 ]: ck=2 |
> Row[info=[ts=-9223372036854775808] ]: ck=3 | del(v)=deletedAt=2, 
> localDeletion=1588588821
> ])
> {code}
> {{LegacyLayout}} correctly interprets a range tombstone whose start and 
> finish {{collectionName}} values don't match as a wrapping fragment of a 
> legacy row deletion that's being interrupted by a collection deletion 
> (correctly) - see 
> [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1874-L1889].
>  Quoting the comment inline:
> {code}
> // Because of the way RangeTombstoneList work, we can have a tombstone where 
> only one of
> // the bound has a collectionName. That happens if we have a big tombstone A 
> (spanning one
> // or multiple rows) and a collection tombstone B. In that case, 
> RangeTombstoneList will
> // split this into 3 RTs: the first one from the beginning of A to the 
> beginning of B,
> // then B, then a third one from the end of B to the end of A. To make this 
> simpler, if
>  // we detect that case we transform the 1st and 3rd tombstone so they don't 
> end in the middle
>  // of a row (which is still correct).
> {code}
> {{LegacyLayout#addRowTombstone()}} method then chokes when it encounters such 
> a tombstone in the middle of an existing row - having seen {{v[c]=d}} first, 
> and mistakenly starts a new row, while in the middle of an existing one: (see 
> 

[jira] [Updated] (CASSANDRA-15789) Rows can get duplicated in mixed major-version clusters and after full upgrade

2020-05-20 Thread Marcus Eriksson (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcus Eriksson updated CASSANDRA-15789:

Status: Ready to Commit  (was: Review In Progress)

> Rows can get duplicated in mixed major-version clusters and after full upgrade
> --
>
> Key: CASSANDRA-15789
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15789
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Coordination, Local/Memtable, Local/SSTable
>Reporter: Aleksey Yeschenko
>Assignee: Marcus Eriksson
>Priority: Normal
>
> In a mixed 2.X/3.X major version cluster a sequence of row deletes, 
> collection overwrites, paging, and read repair can cause 3.X nodes to split 
> individual rows into several rows with identical clustering. This happens due 
> to 2.X paging and RT semantics, and a 3.X {{LegacyLayout}} deficiency.
> To reproduce, set up a 2-node mixed major version cluster with the following 
> table:
> {code}
> CREATE TABLE distributed_test_keyspace.tlb (
> pk int,
> ck int,
> v map,
> PRIMARY KEY (pk, ck)
> );
> {code}
> 1. Using either node as the coordinator, delete the row with ck=2 using 
> timestamp 1
> {code}
> DELETE FROM tbl USING TIMESTAMP 1 WHERE pk = 1 AND ck = 2;
> {code}
> 2. Using either node as the coordinator, insert the following 3 rows:
> {code}
> INSERT INTO tbl (pk, ck, v) VALUES (1, 1, {'e':'f'}) USING TIMESTAMP 3;
> INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 3;
> INSERT INTO tbl (pk, ck, v) VALUES (1, 3, {'i':'j'}) USING TIMESTAMP 3;
> {code}
> 3. Flush the table on both nodes
> 4. Using the 2.2 node as the coordinator, force read repar by querying the 
> table with page size = 2:
>  
> {code}
> SELECT * FROM tbl;
> {code}
> 5. Overwrite the row with ck=2 using timestamp 5:
> {code}
> INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 5;}}
> {code}
> 6. Query the 3.0 node and observe the split row:
> {code}
> cqlsh> select * from distributed_test_keyspace.tlb ;
>  pk | ck | v
> ++
>   1 |  1 | {'e': 'f'}
>   1 |  2 | {'g': 'h'}
>   1 |  2 | {'k': 'l'}
>   1 |  3 | {'i': 'j'}
> {code}
> This happens because the read to query the second page ends up generating the 
> following mutation for the 3.0 node:
> {code}
> ColumnFamily(tbl -{deletedAt=-9223372036854775808, localDeletion=2147483647,
>  ranges=[2:v:_-2:v:!, deletedAt=2, localDeletion=1588588821]
> [2:v:!-2:!,   deletedAt=1, localDeletion=1588588821]
> [3:v:_-3:v:!, deletedAt=2, localDeletion=1588588821]}-
>  [2:v:63:false:1@3,])
> {code}
> Which on 3.0 side gets incorrectly deserialized as
> {code}
> Mutation(keyspace='distributed_test_keyspace', key='0001', modifications=[
>   [distributed_test_keyspace.tbl] key=1 
> partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647 
> columns=[[] | [v]]
> Row[info=[ts=-9223372036854775808] ]: ck=2 | del(v)=deletedAt=2, 
> localDeletion=1588588821, [v[c]=d ts=3]
> Row[info=[ts=-9223372036854775808] del=deletedAt=1, 
> localDeletion=1588588821 ]: ck=2 |
> Row[info=[ts=-9223372036854775808] ]: ck=3 | del(v)=deletedAt=2, 
> localDeletion=1588588821
> ])
> {code}
> {{LegacyLayout}} correctly interprets a range tombstone whose start and 
> finish {{collectionName}} values don't match as a wrapping fragment of a 
> legacy row deletion that's being interrupted by a collection deletion 
> (correctly) - see 
> [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1874-L1889].
>  Quoting the comment inline:
> {code}
> // Because of the way RangeTombstoneList work, we can have a tombstone where 
> only one of
> // the bound has a collectionName. That happens if we have a big tombstone A 
> (spanning one
> // or multiple rows) and a collection tombstone B. In that case, 
> RangeTombstoneList will
> // split this into 3 RTs: the first one from the beginning of A to the 
> beginning of B,
> // then B, then a third one from the end of B to the end of A. To make this 
> simpler, if
>  // we detect that case we transform the 1st and 3rd tombstone so they don't 
> end in the middle
>  // of a row (which is still correct).
> {code}
> {{LegacyLayout#addRowTombstone()}} method then chokes when it encounters such 
> a tombstone in the middle of an existing row - having seen {{v[c]=d}} first, 
> and mistakenly starts a new row, while in the middle of an existing one: (see 
> [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1500-L1501]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CASSANDRA-15789) Rows can get duplicated in mixed major-version clusters and after full upgrade

2020-05-18 Thread Marcus Eriksson (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcus Eriksson updated CASSANDRA-15789:

Reviewers: Alex Petrov, Sylvain Lebresne  (was: Alex Petrov)

> Rows can get duplicated in mixed major-version clusters and after full upgrade
> --
>
> Key: CASSANDRA-15789
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15789
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Coordination, Local/Memtable, Local/SSTable
>Reporter: Aleksey Yeschenko
>Assignee: Marcus Eriksson
>Priority: Normal
>
> In a mixed 2.X/3.X major version cluster a sequence of row deletes, 
> collection overwrites, paging, and read repair can cause 3.X nodes to split 
> individual rows into several rows with identical clustering. This happens due 
> to 2.X paging and RT semantics, and a 3.X {{LegacyLayout}} deficiency.
> To reproduce, set up a 2-node mixed major version cluster with the following 
> table:
> {code}
> CREATE TABLE distributed_test_keyspace.tlb (
> pk int,
> ck int,
> v map,
> PRIMARY KEY (pk, ck)
> );
> {code}
> 1. Using either node as the coordinator, delete the row with ck=2 using 
> timestamp 1
> {code}
> DELETE FROM tbl USING TIMESTAMP 1 WHERE pk = 1 AND ck = 2;
> {code}
> 2. Using either node as the coordinator, insert the following 3 rows:
> {code}
> INSERT INTO tbl (pk, ck, v) VALUES (1, 1, {'e':'f'}) USING TIMESTAMP 3;
> INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 3;
> INSERT INTO tbl (pk, ck, v) VALUES (1, 3, {'i':'j'}) USING TIMESTAMP 3;
> {code}
> 3. Flush the table on both nodes
> 4. Using the 2.2 node as the coordinator, force read repar by querying the 
> table with page size = 2:
>  
> {code}
> SELECT * FROM tbl;
> {code}
> 5. Overwrite the row with ck=2 using timestamp 5:
> {code}
> INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 5;}}
> {code}
> 6. Query the 3.0 node and observe the split row:
> {code}
> cqlsh> select * from distributed_test_keyspace.tlb ;
>  pk | ck | v
> ++
>   1 |  1 | {'e': 'f'}
>   1 |  2 | {'g': 'h'}
>   1 |  2 | {'k': 'l'}
>   1 |  3 | {'i': 'j'}
> {code}
> This happens because the read to query the second page ends up generating the 
> following mutation for the 3.0 node:
> {code}
> ColumnFamily(tbl -{deletedAt=-9223372036854775808, localDeletion=2147483647,
>  ranges=[2:v:_-2:v:!, deletedAt=2, localDeletion=1588588821]
> [2:v:!-2:!,   deletedAt=1, localDeletion=1588588821]
> [3:v:_-3:v:!, deletedAt=2, localDeletion=1588588821]}-
>  [2:v:63:false:1@3,])
> {code}
> Which on 3.0 side gets incorrectly deserialized as
> {code}
> Mutation(keyspace='distributed_test_keyspace', key='0001', modifications=[
>   [distributed_test_keyspace.tbl] key=1 
> partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647 
> columns=[[] | [v]]
> Row[info=[ts=-9223372036854775808] ]: ck=2 | del(v)=deletedAt=2, 
> localDeletion=1588588821, [v[c]=d ts=3]
> Row[info=[ts=-9223372036854775808] del=deletedAt=1, 
> localDeletion=1588588821 ]: ck=2 |
> Row[info=[ts=-9223372036854775808] ]: ck=3 | del(v)=deletedAt=2, 
> localDeletion=1588588821
> ])
> {code}
> {{LegacyLayout}} correctly interprets a range tombstone whose start and 
> finish {{collectionName}} values don't match as a wrapping fragment of a 
> legacy row deletion that's being interrupted by a collection deletion 
> (correctly) - see 
> [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1874-L1889].
>  Quoting the comment inline:
> {code}
> // Because of the way RangeTombstoneList work, we can have a tombstone where 
> only one of
> // the bound has a collectionName. That happens if we have a big tombstone A 
> (spanning one
> // or multiple rows) and a collection tombstone B. In that case, 
> RangeTombstoneList will
> // split this into 3 RTs: the first one from the beginning of A to the 
> beginning of B,
> // then B, then a third one from the end of B to the end of A. To make this 
> simpler, if
>  // we detect that case we transform the 1st and 3rd tombstone so they don't 
> end in the middle
>  // of a row (which is still correct).
> {code}
> {{LegacyLayout#addRowTombstone()}} method then chokes when it encounters such 
> a tombstone in the middle of an existing row - having seen {{v[c]=d}} first, 
> and mistakenly starts a new row, while in the middle of an existing one: (see 
> [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1500-L1501]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CASSANDRA-15789) Rows can get duplicated in mixed major-version clusters and after full upgrade

2020-05-06 Thread Marcus Eriksson (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcus Eriksson updated CASSANDRA-15789:

Test and Documentation Plan: new tests and cci runs
 Status: Patch Available  (was: Open)

> Rows can get duplicated in mixed major-version clusters and after full upgrade
> --
>
> Key: CASSANDRA-15789
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15789
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Coordination, Local/Memtable, Local/SSTable
>Reporter: Aleksey Yeschenko
>Assignee: Marcus Eriksson
>Priority: Normal
>
> In a mixed 2.X/3.X major version cluster a sequence of row deletes, 
> collection overwrites, paging, and read repair can cause 3.X nodes to split 
> individual rows into several rows with identical clustering. This happens due 
> to 2.X paging and RT semantics, and a 3.X {{LegacyLayout}} deficiency.
> To reproduce, set up a 2-node mixed major version cluster with the following 
> table:
> {code}
> CREATE TABLE distributed_test_keyspace.tlb (
> pk int,
> ck int,
> v map,
> PRIMARY KEY (pk, ck)
> );
> {code}
> 1. Using either node as the coordinator, delete the row with ck=2 using 
> timestamp 1
> {code}
> DELETE FROM tbl USING TIMESTAMP 1 WHERE pk = 1 AND ck = 2;
> {code}
> 2. Using either node as the coordinator, insert the following 3 rows:
> {code}
> INSERT INTO tbl (pk, ck, v) VALUES (1, 1, {'e':'f'}) USING TIMESTAMP 3;
> INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 3;
> INSERT INTO tbl (pk, ck, v) VALUES (1, 3, {'i':'j'}) USING TIMESTAMP 3;
> {code}
> 3. Flush the table on both nodes
> 4. Using the 2.2 node as the coordinator, force read repar by querying the 
> table with page size = 2:
>  
> {code}
> SELECT * FROM tbl;
> {code}
> 5. Overwrite the row with ck=2 using timestamp 5:
> {code}
> INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 5;}}
> {code}
> 6. Query the 3.0 node and observe the split row:
> {code}
> cqlsh> select * from distributed_test_keyspace.tlb ;
>  pk | ck | v
> ++
>   1 |  1 | {'e': 'f'}
>   1 |  2 | {'g': 'h'}
>   1 |  2 | {'k': 'l'}
>   1 |  3 | {'i': 'j'}
> {code}
> This happens because the read to query the second page ends up generating the 
> following mutation for the 3.0 node:
> {code}
> ColumnFamily(tbl -{deletedAt=-9223372036854775808, localDeletion=2147483647,
>  ranges=[2:v:_-2:v:!, deletedAt=2, localDeletion=1588588821]
> [2:v:!-2:!,   deletedAt=1, localDeletion=1588588821]
> [3:v:_-3:v:!, deletedAt=2, localDeletion=1588588821]}-
>  [2:v:63:false:1@3,])
> {code}
> Which on 3.0 side gets incorrectly deserialized as
> {code}
> Mutation(keyspace='distributed_test_keyspace', key='0001', modifications=[
>   [distributed_test_keyspace.tbl] key=1 
> partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647 
> columns=[[] | [v]]
> Row[info=[ts=-9223372036854775808] ]: ck=2 | del(v)=deletedAt=2, 
> localDeletion=1588588821, [v[c]=d ts=3]
> Row[info=[ts=-9223372036854775808] del=deletedAt=1, 
> localDeletion=1588588821 ]: ck=2 |
> Row[info=[ts=-9223372036854775808] ]: ck=3 | del(v)=deletedAt=2, 
> localDeletion=1588588821
> ])
> {code}
> {{LegacyLayout}} correctly interprets a range tombstone whose start and 
> finish {{collectionName}} values don't match as a wrapping fragment of a 
> legacy row deletion that's being interrupted by a collection deletion 
> (correctly) - see 
> [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1874-L1889].
>  Quoting the comment inline:
> {code}
> // Because of the way RangeTombstoneList work, we can have a tombstone where 
> only one of
> // the bound has a collectionName. That happens if we have a big tombstone A 
> (spanning one
> // or multiple rows) and a collection tombstone B. In that case, 
> RangeTombstoneList will
> // split this into 3 RTs: the first one from the beginning of A to the 
> beginning of B,
> // then B, then a third one from the end of B to the end of A. To make this 
> simpler, if
>  // we detect that case we transform the 1st and 3rd tombstone so they don't 
> end in the middle
>  // of a row (which is still correct).
> {code}
> {{LegacyLayout#addRowTombstone()}} method then chokes when it encounters such 
> a tombstone in the middle of an existing row - having seen {{v[c]=d}} first, 
> and mistakenly starts a new row, while in the middle of an existing one: (see 
> [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1500-L1501]).



--
This message was sent by 

[jira] [Updated] (CASSANDRA-15789) Rows can get duplicated in mixed major-version clusters and after full upgrade

2020-05-05 Thread Marcus Eriksson (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcus Eriksson updated CASSANDRA-15789:

Authors: Aleksey Yeschenko, Marcus Eriksson, Sam Tunnicliffe  (was: Marcus 
Eriksson)

> Rows can get duplicated in mixed major-version clusters and after full upgrade
> --
>
> Key: CASSANDRA-15789
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15789
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Coordination, Local/Memtable, Local/SSTable
>Reporter: Aleksey Yeschenko
>Assignee: Marcus Eriksson
>Priority: Normal
>
> In a mixed 2.X/3.X major version cluster a sequence of row deletes, 
> collection overwrites, paging, and read repair can cause 3.X nodes to split 
> individual rows into several rows with identical clustering. This happens due 
> to 2.X paging and RT semantics, and a 3.X {{LegacyLayout}} deficiency.
> To reproduce, set up a 2-node mixed major version cluster with the following 
> table:
> {code}
> CREATE TABLE distributed_test_keyspace.tlb (
> pk int,
> ck int,
> v map,
> PRIMARY KEY (pk, ck)
> );
> {code}
> 1. Using either node as the coordinator, delete the row with ck=2 using 
> timestamp 1
> {code}
> DELETE FROM tbl USING TIMESTAMP 1 WHERE pk = 1 AND ck = 2;
> {code}
> 2. Using either node as the coordinator, insert the following 3 rows:
> {code}
> INSERT INTO tbl (pk, ck, v) VALUES (1, 1, {'e':'f'}) USING TIMESTAMP 3;
> INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 3;
> INSERT INTO tbl (pk, ck, v) VALUES (1, 3, {'i':'j'}) USING TIMESTAMP 3;
> {code}
> 3. Flush the table on both nodes
> 4. Using the 2.2 node as the coordinator, force read repar by querying the 
> table with page size = 2:
>  
> {code}
> SELECT * FROM tbl;
> {code}
> 5. Overwrite the row with ck=2 using timestamp 5:
> {code}
> INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 5;}}
> {code}
> 6. Query the 3.0 node and observe the split row:
> {code}
> cqlsh> select * from distributed_test_keyspace.tlb ;
>  pk | ck | v
> ++
>   1 |  1 | {'e': 'f'}
>   1 |  2 | {'g': 'h'}
>   1 |  2 | {'k': 'l'}
>   1 |  3 | {'i': 'j'}
> {code}
> This happens because the read to query the second page ends up generating the 
> following mutation for the 3.0 node:
> {code}
> ColumnFamily(tbl -{deletedAt=-9223372036854775808, localDeletion=2147483647,
>  ranges=[2:v:_-2:v:!, deletedAt=2, localDeletion=1588588821]
> [2:v:!-2:!,   deletedAt=1, localDeletion=1588588821]
> [3:v:_-3:v:!, deletedAt=2, localDeletion=1588588821]}-
>  [2:v:63:false:1@3,])
> {code}
> Which on 3.0 side gets incorrectly deserialized as
> {code}
> Mutation(keyspace='distributed_test_keyspace', key='0001', modifications=[
>   [distributed_test_keyspace.tbl] key=1 
> partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647 
> columns=[[] | [v]]
> Row[info=[ts=-9223372036854775808] ]: ck=2 | del(v)=deletedAt=2, 
> localDeletion=1588588821, [v[c]=d ts=3]
> Row[info=[ts=-9223372036854775808] del=deletedAt=1, 
> localDeletion=1588588821 ]: ck=2 |
> Row[info=[ts=-9223372036854775808] ]: ck=3 | del(v)=deletedAt=2, 
> localDeletion=1588588821
> ])
> {code}
> {{LegacyLayout}} correctly interprets a range tombstone whose start and 
> finish {{collectionName}} values don't match as a wrapping fragment of a 
> legacy row deletion that's being interrupted by a collection deletion 
> (correctly) - see 
> [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1874-L1889].
>  Quoting the comment inline:
> {code}
> // Because of the way RangeTombstoneList work, we can have a tombstone where 
> only one of
> // the bound has a collectionName. That happens if we have a big tombstone A 
> (spanning one
> // or multiple rows) and a collection tombstone B. In that case, 
> RangeTombstoneList will
> // split this into 3 RTs: the first one from the beginning of A to the 
> beginning of B,
> // then B, then a third one from the end of B to the end of A. To make this 
> simpler, if
>  // we detect that case we transform the 1st and 3rd tombstone so they don't 
> end in the middle
>  // of a row (which is still correct).
> {code}
> {{LegacyLayout#addRowTombstone()}} method then chokes when it encounters such 
> a tombstone in the middle of an existing row - having seen {{v[c]=d}} first, 
> and mistakenly starts a new row, while in the middle of an existing one: (see 
> [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1500-L1501]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CASSANDRA-15789) Rows can get duplicated in mixed major-version clusters and after full upgrade

2020-05-05 Thread Aleksey Yeschenko (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksey Yeschenko updated CASSANDRA-15789:
--
 Bug Category: Parent values: Correctness(12982)Level 1 values: Recoverable 
Corruption / Loss(12986)
   Complexity: Normal
  Component/s: Local/SSTable
   Local/Memtable
   Consistency/Coordination
Discovered By: Fuzz Test
Reviewers: Alex Petrov
 Severity: Critical
   Status: Open  (was: Triage Needed)

> Rows can get duplicated in mixed major-version clusters and after full upgrade
> --
>
> Key: CASSANDRA-15789
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15789
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Coordination, Local/Memtable, Local/SSTable
>Reporter: Aleksey Yeschenko
>Assignee: Marcus Eriksson
>Priority: Normal
>
> In a mixed 2.X/3.X major version cluster a sequence of row deletes, 
> collection overwrites, paging, and read repair can cause 3.X nodes to split 
> individual rows into several rows with identical clustering. This happens due 
> to 2.X paging and RT semantics, and a 3.X {{LegacyLayout}} deficiency.
> To reproduce, set up a 2-node mixed major version cluster with the following 
> table:
> {code}
> CREATE TABLE distributed_test_keyspace.tlb (
> pk int,
> ck int,
> v map,
> PRIMARY KEY (pk, ck)
> );
> {code}
> 1. Using either node as the coordinator, delete the row with ck=2 using 
> timestamp 1
> {code}
> DELETE FROM tbl USING TIMESTAMP 1 WHERE pk = 1 AND ck = 2;
> {code}
> 2. Using either node as the coordinator, insert the following 3 rows:
> {code}
> INSERT INTO tbl (pk, ck, v) VALUES (1, 1, {'e':'f'}) USING TIMESTAMP 3;
> INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 3;
> INSERT INTO tbl (pk, ck, v) VALUES (1, 3, {'i':'j'}) USING TIMESTAMP 3;
> {code}
> 3. Flush the table on both nodes
> 4. Using the 2.2 node as the coordinator, force read repar by querying the 
> table with page size = 2:
>  
> {code}
> SELECT * FROM tbl;
> {code}
> 5. Overwrite the row with ck=2 using timestamp 5:
> {code}
> INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 5;}}
> {code}
> 6. Query the 3.0 node and observe the split row:
> {code}
> cqlsh> select * from distributed_test_keyspace.tlb ;
>  pk | ck | v
> ++
>   1 |  1 | {'e': 'f'}
>   1 |  2 | {'g': 'h'}
>   1 |  2 | {'k': 'l'}
>   1 |  3 | {'i': 'j'}
> {code}
> This happens because the read to query the second page ends up generating the 
> following mutation for the 3.0 node:
> {code}
> ColumnFamily(tbl -{deletedAt=-9223372036854775808, localDeletion=2147483647,
>  ranges=[2:v:_-2:v:!, deletedAt=2, localDeletion=1588588821]
> [2:v:!-2:!,   deletedAt=1, localDeletion=1588588821]
> [3:v:_-3:v:!, deletedAt=2, localDeletion=1588588821]}-
>  [2:v:63:false:1@3,])
> {code}
> Which on 3.0 side gets incorrectly deserialized as
> {code}
> Mutation(keyspace='distributed_test_keyspace', key='0001', modifications=[
>   [distributed_test_keyspace.tbl] key=1 
> partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647 
> columns=[[] | [v]]
> Row[info=[ts=-9223372036854775808] ]: ck=2 | del(v)=deletedAt=2, 
> localDeletion=1588588821, [v[c]=d ts=3]
> Row[info=[ts=-9223372036854775808] del=deletedAt=1, 
> localDeletion=1588588821 ]: ck=2 |
> Row[info=[ts=-9223372036854775808] ]: ck=3 | del(v)=deletedAt=2, 
> localDeletion=1588588821
> ])
> {code}
> {{LegacyLayout}} correctly interprets a range tombstone whose start and 
> finish {{collectionName}} values don't match as a wrapping fragment of a 
> legacy row deletion that's being interrupted by a collection deletion 
> (correctly) - see 
> [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1874-L1889].
>  Quoting the comment inline:
> {code}
> // Because of the way RangeTombstoneList work, we can have a tombstone where 
> only one of
> // the bound has a collectionName. That happens if we have a big tombstone A 
> (spanning one
> // or multiple rows) and a collection tombstone B. In that case, 
> RangeTombstoneList will
> // split this into 3 RTs: the first one from the beginning of A to the 
> beginning of B,
> // then B, then a third one from the end of B to the end of A. To make this 
> simpler, if
>  // we detect that case we transform the 1st and 3rd tombstone so they don't 
> end in the middle
>  // of a row (which is still correct).
> {code}
> {{LegacyLayout#addRowTombstone()}} method then chokes when it encounters such 
> a tombstone in the middle of an existing row - having seen 

[jira] [Updated] (CASSANDRA-15789) Rows can get duplicated in mixed major-version clusters and after full upgrade

2020-05-05 Thread Aleksey Yeschenko (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksey Yeschenko updated CASSANDRA-15789:
--
Description: 
In a mixed 2.X/3.X major version cluster a sequence of row deletes, collection 
overwrites, paging, and read repair can cause 3.X nodes to split individual 
rows into several rows with identical clustering. This happens due to 2.X 
paging and RT semantics, and a 3.X {{LegacyLayout}} deficiency.

To reproduce, set up a 2-node mixed major version cluster with the following 
table:

{code}
CREATE TABLE distributed_test_keyspace.tlb (
pk int,
ck int,
v map,
PRIMARY KEY (pk, ck)
);
{code}

1. Using either node as the coordinator, delete the row with ck=2 using 
timestamp 1

{code}
DELETE FROM tbl USING TIMESTAMP 1 WHERE pk = 1 AND ck = 2;
{code}

2. Using either node as the coordinator, insert the following 3 rows:

{code}
INSERT INTO tbl (pk, ck, v) VALUES (1, 1, {'e':'f'}) USING TIMESTAMP 3;
INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 3;
INSERT INTO tbl (pk, ck, v) VALUES (1, 3, {'i':'j'}) USING TIMESTAMP 3;
{code}

3. Flush the table on both nodes

4. Using the 2.2 node as the coordinator, force read repar by querying the 
table with page size = 2:
 
{code}
SELECT * FROM tbl;
{code}

5. Overwrite the row with ck=2 using timestamp 5:

{code}
INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 5;}}
{code}

6. Query the 3.0 node and observe the split row:

{code}
cqlsh> select * from distributed_test_keyspace.tlb ;

 pk | ck | v
++
  1 |  1 | {'e': 'f'}
  1 |  2 | {'g': 'h'}
  1 |  2 | {'k': 'l'}
  1 |  3 | {'i': 'j'}
{code}

This happens because the read to query the second page ends up generating the 
following mutation for the 3.0 node:

{code}
ColumnFamily(tbl -{deletedAt=-9223372036854775808, localDeletion=2147483647,
 ranges=[2:v:_-2:v:!, deletedAt=2, localDeletion=1588588821]
[2:v:!-2:!,   deletedAt=1, localDeletion=1588588821]
[3:v:_-3:v:!, deletedAt=2, localDeletion=1588588821]}-
 [2:v:63:false:1@3,])
{code}

Which on 3.0 side gets incorrectly deserialized as

{code}
Mutation(keyspace='distributed_test_keyspace', key='0001', modifications=[
  [distributed_test_keyspace.tbl] key=1 
partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647 
columns=[[] | [v]]
Row[info=[ts=-9223372036854775808] ]: ck=2 | del(v)=deletedAt=2, 
localDeletion=1588588821, [v[c]=d ts=3]
Row[info=[ts=-9223372036854775808] del=deletedAt=1, 
localDeletion=1588588821 ]: ck=2 |
Row[info=[ts=-9223372036854775808] ]: ck=3 | del(v)=deletedAt=2, 
localDeletion=1588588821
])
{code}

{{LegacyLayout}} correctly interprets a range tombstone whose start and finish 
{{collectionName}} values don't match as a wrapping fragment of a legacy row 
deletion that's being interrupted by a collection deletion (correctly) - see 
[code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1874-L1889].
 Quoting the comment inline:

{code}
// Because of the way RangeTombstoneList work, we can have a tombstone where 
only one of
// the bound has a collectionName. That happens if we have a big tombstone A 
(spanning one
// or multiple rows) and a collection tombstone B. In that case, 
RangeTombstoneList will
// split this into 3 RTs: the first one from the beginning of A to the 
beginning of B,
// then B, then a third one from the end of B to the end of A. To make this 
simpler, if
 // we detect that case we transform the 1st and 3rd tombstone so they don't 
end in the middle
 // of a row (which is still correct).
{code}

{{LegacyLayout#addRowTombstone()}} method then chokes when it encounters such a 
tombstone in the middle of an existing row - having seen {{v[c]=d}} first, and 
mistakenly starts a new row, while in the middle of an existing one: (see 
[code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1500-L1501]).


> Rows can get duplicated in mixed major-version clusters and after full upgrade
> --
>
> Key: CASSANDRA-15789
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15789
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Aleksey Yeschenko
>Assignee: Marcus Eriksson
>Priority: Normal
>
> In a mixed 2.X/3.X major version cluster a sequence of row deletes, 
> collection overwrites, paging, and read repair can cause 3.X nodes to split 
> individual rows into several rows with identical clustering. This happens due 
> to 2.X paging and RT semantics, and a 3.X {{LegacyLayout}} deficiency.
> To reproduce, set up a 2-node mixed major version cluster with the following 

[jira] [Updated] (CASSANDRA-15789) Rows can get duplicated in mixed major-version clusters and after full upgrade

2020-05-05 Thread Aleksey Yeschenko (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksey Yeschenko updated CASSANDRA-15789:
--
Summary: Rows can get duplicated in mixed major-version clusters and after 
full upgrade  (was: TBD)

> Rows can get duplicated in mixed major-version clusters and after full upgrade
> --
>
> Key: CASSANDRA-15789
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15789
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Aleksey Yeschenko
>Assignee: Marcus Eriksson
>Priority: Normal
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org