[jira] [Commented] (CASSANDRA-17991) Possible data inconsistency during bootstrap/decommission

Jeff Jirsa (Jira) Fri, 28 Oct 2022 16:40:37 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-17991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625960#comment-17625960
 ]


Jeff Jirsa commented on CASSANDRA-17991:
----------------------------------------

> Let's consider a scenario in that _n4_ receives SSTable1 and SSTable3, but 
> not yet SSTable2, and _n4_ compacts SSTable1 and SSTable3. In this case, _n4_ 
> would purge the key "1". So at this time, there are no traces of key "1" on 
> {_}n4{_}. After some time, SSTable2 is streamed, and at this time it will 
> stream the key "1" as well.

 

It should be atomic, it shouldn't let anything compact until the streaming is 
done. In what situation are you seeing this be non-atomic?

 

 

 

> Possible data inconsistency during bootstrap/decommission
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-17991
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17991
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Bootstrap and Decommission
>            Reporter: Jaydeepkumar Chovatia
>            Priority: Normal
>
> I am facing one corner case in which the deleted data resurrects.
> tl;dr: This could be because when we stream all the SSTables for a given 
> token range to the new owner, then they are not sent atomically, so the new 
> owner could do compaction on the partially received SSTables, which might 
> remove the tombstones.
>  
> Here are the reproducible steps:
> +*Prerequisite*+ # Three nodes Cassandra cluster n1, n2, and n3 (C* version 
> 3.0.27)
>  # 
> {code:java}
> CREATE KEYSPACE KS1 WITH replication = {'class': 'NetworkTopologyStrategy', 
> 'dc1': '3'};
> CREATE TABLE KS1.T1 (
>     key int,
>     c1 int,
>     c2 int,
>     c3 int
>     PRIMARY KEY (key)
> ) WITH CLUSTERING ORDER BY (key ASC)
>  AND compaction = {'class': 
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
> 'max_threshold': '32', 'min_threshold': '4'}
>  AND gc_grace_seconds = 864000;
> {code}
>  
> *Reproducible Steps*
>  * Day1: Insert a new record followed by {_}nodetool flush on n1, n2, and 
> n3{_}. A new SSTable ({_}SSTable1{_}) will be created.
> {code:java}
> INSERT INTO KS1.T1 (key, c1, c2, c3) values (1, 10, 20, 30){code}
>  * Day2: Insert the same record again followed by _nodetool flush_ {_}on n1, 
> n2, and n3{_}{_}.{_} A new SSTable ({_}SSTable2{_}) will be created
> {code:java}
>  INSERT INTO KS1.T1 (key, c1, c2, c3) values (1, 10, 20, 30){code}
>  * Day3: Here is the data layout on SSTables on n1, n2, and n3 
> {code:java}
> SSTable1:
> {
>     "partition" : {
>       "key" : [ "1" ],
>       "position" : 900
>     },
>     "rows" : [
>         "type" : "row",  
>         "position" : 10,
>         "liveness_info" : { "tstamp" : "2022-10-15T00:00:00.000001Z"},
>         "cells" : [
>             { "name" : "c1", "value" : 10 },
>             { "name" : "c2", "value" : 20 },
>             { "name" : "c3", "value" : 30 },
>           ]
> }
> .....
> SSTable2:
> {
>     "partition" : {
>       "key" : [ "1" ],
>       "position" : 900
>     },
>     "rows" : [
>         "type" : "row",  
>         "position" : 10,
>         "liveness_info" : { "tstamp" : "2022-10-16T00:00:00.000001Z"},
>         "cells" : [
>             { "name" : "c1", "value" : 10 },
>             { "name" : "c2", "value" : 20 },
>             { "name" : "c3", "value" : 30 },
>           ]
> }
> {code}
>  * Day4: Delete the record followed by _nodetool flush_ on n1, n2, and n3
> {code:java}
> CONSISTENCY ALL; DELETE FROM KS1.T1 WHERE key = 1; {code}
>  * Day5: Here is the data layout on SSTables on n1, n2, and n3 
> {code:java}
> SSTable1:
> {
>     "partition" : {
>       "key" : [ "1" ],
>       "position" : 900
>     },
>     "rows" : [
>         "type" : "row",  
>         "position" : 10,
>         "liveness_info" : { "tstamp" : "2022-10-15T00:00:00.000001Z"},
>         "cells" : [
>             { "name" : "c1", "value" : 10 },
>             { "name" : "c2", "value" : 20 },
>             { "name" : "c3", "value" : 30 },
>           ]
> }
> .....
> SSTable2:
> {
>     "partition" : {
>       "key" : [ "1" ],
>       "position" : 900
>     },
>     "rows" : [
>         "type" : "row",  
>         "position" : 10,
>         "liveness_info" : { "tstamp" : "2022-10-16T00:00:00.000001Z"},
>         "cells" : [
>             { "name" : "c1", "value" : 10 },
>             { "name" : "c2", "value" : 20 },
>             { "name" : "c3", "value" : 30 },
>           ]
> }
> .....
> SSTable3 (Tombstone):
> {
>     "partition" : {
>       "key" : [ "1" ],
>       "position" : 900
>     },
>     "rows" : [
>         "type" : "row",  
>         "position" : 10,
>         "deletion_info" : { "marked_deleted" : "2022-10-19T00:00:00.000001Z", 
> "local_delete_time" : "2022-10-19T00:00:00.000001Z" },
>         "cells" : [ ]
> }
> {code}
>  * Day20: Nothing happens for more than 10 days. Let's say the data layout on 
> SSTables on n1, n2, and n3 is the same as Day5
>  * Day20: A new node (n4) joins the ring, and it is going to be responsible 
> for key "1". Let's say it streams the data from n3. The node _n3_ is supposed 
> to stream out SSTable1, SSTable2, and SSTable3, but it does not happen 
> atomically as per the streaming algorithm. Let's consider a scenario in that 
> _n4_ receives SSTable1 and SSTable3, but not yet SSTable2, and _n4_ compacts 
> SSTable1 and SSTable3. In this case, _n4_ would purge the key "1". So at this 
> time, there are no traces of key "1" on {_}n4{_}. After some time, SSTable2 
> is streamed, and at this time it will stream the key "1" as well.
>  * Day20: _n4_ becomes normal 
> {code:java}
> Query on n4:
> $> CONSISTENCY LOCAL_ONE; SELECT * FROM KS1.T1 WHERE key = 1;
> // 1 | 10 | 20 | 30 <-- A record is returned
> Query on n1:
> $> CONSISTENCY LOCAL_ONE; SELECT * FROM KS1.T1 WHERE key = 1;
> // <empty> //no output{code}
>  
> Does this make sense?
> *Possible Solution*
>  * One of the solutions is to maybe not purge tombstones while there are 
> token range movements in the ring



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-17991) Possible data inconsistency during bootstrap/decommission

Reply via email to