[ https://issues.apache.org/jira/browse/CASSANDRA-17991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625960#comment-17625960 ]
Jeff Jirsa commented on CASSANDRA-17991: ---------------------------------------- > Let's consider a scenario in that _n4_ receives SSTable1 and SSTable3, but > not yet SSTable2, and _n4_ compacts SSTable1 and SSTable3. In this case, _n4_ > would purge the key "1". So at this time, there are no traces of key "1" on > {_}n4{_}. After some time, SSTable2 is streamed, and at this time it will > stream the key "1" as well. It should be atomic, it shouldn't let anything compact until the streaming is done. In what situation are you seeing this be non-atomic? > Possible data inconsistency during bootstrap/decommission > --------------------------------------------------------- > > Key: CASSANDRA-17991 > URL: https://issues.apache.org/jira/browse/CASSANDRA-17991 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Bootstrap and Decommission > Reporter: Jaydeepkumar Chovatia > Priority: Normal > > I am facing one corner case in which the deleted data resurrects. > tl;dr: This could be because when we stream all the SSTables for a given > token range to the new owner, then they are not sent atomically, so the new > owner could do compaction on the partially received SSTables, which might > remove the tombstones. > > Here are the reproducible steps: > +*Prerequisite*+ # Three nodes Cassandra cluster n1, n2, and n3 (C* version > 3.0.27) > # > {code:java} > CREATE KEYSPACE KS1 WITH replication = {'class': 'NetworkTopologyStrategy', > 'dc1': '3'}; > CREATE TABLE KS1.T1 ( > key int, > c1 int, > c2 int, > c3 int > PRIMARY KEY (key) > ) WITH CLUSTERING ORDER BY (key ASC) > AND compaction = {'class': > 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', > 'max_threshold': '32', 'min_threshold': '4'} > AND gc_grace_seconds = 864000; > {code} > > *Reproducible Steps* > * Day1: Insert a new record followed by {_}nodetool flush on n1, n2, and > n3{_}. A new SSTable ({_}SSTable1{_}) will be created. > {code:java} > INSERT INTO KS1.T1 (key, c1, c2, c3) values (1, 10, 20, 30){code} > * Day2: Insert the same record again followed by _nodetool flush_ {_}on n1, > n2, and n3{_}{_}.{_} A new SSTable ({_}SSTable2{_}) will be created > {code:java} > INSERT INTO KS1.T1 (key, c1, c2, c3) values (1, 10, 20, 30){code} > * Day3: Here is the data layout on SSTables on n1, n2, and n3 > {code:java} > SSTable1: > { > "partition" : { > "key" : [ "1" ], > "position" : 900 > }, > "rows" : [ > "type" : "row", > "position" : 10, > "liveness_info" : { "tstamp" : "2022-10-15T00:00:00.000001Z"}, > "cells" : [ > { "name" : "c1", "value" : 10 }, > { "name" : "c2", "value" : 20 }, > { "name" : "c3", "value" : 30 }, > ] > } > ..... > SSTable2: > { > "partition" : { > "key" : [ "1" ], > "position" : 900 > }, > "rows" : [ > "type" : "row", > "position" : 10, > "liveness_info" : { "tstamp" : "2022-10-16T00:00:00.000001Z"}, > "cells" : [ > { "name" : "c1", "value" : 10 }, > { "name" : "c2", "value" : 20 }, > { "name" : "c3", "value" : 30 }, > ] > } > {code} > * Day4: Delete the record followed by _nodetool flush_ on n1, n2, and n3 > {code:java} > CONSISTENCY ALL; DELETE FROM KS1.T1 WHERE key = 1; {code} > * Day5: Here is the data layout on SSTables on n1, n2, and n3 > {code:java} > SSTable1: > { > "partition" : { > "key" : [ "1" ], > "position" : 900 > }, > "rows" : [ > "type" : "row", > "position" : 10, > "liveness_info" : { "tstamp" : "2022-10-15T00:00:00.000001Z"}, > "cells" : [ > { "name" : "c1", "value" : 10 }, > { "name" : "c2", "value" : 20 }, > { "name" : "c3", "value" : 30 }, > ] > } > ..... > SSTable2: > { > "partition" : { > "key" : [ "1" ], > "position" : 900 > }, > "rows" : [ > "type" : "row", > "position" : 10, > "liveness_info" : { "tstamp" : "2022-10-16T00:00:00.000001Z"}, > "cells" : [ > { "name" : "c1", "value" : 10 }, > { "name" : "c2", "value" : 20 }, > { "name" : "c3", "value" : 30 }, > ] > } > ..... > SSTable3 (Tombstone): > { > "partition" : { > "key" : [ "1" ], > "position" : 900 > }, > "rows" : [ > "type" : "row", > "position" : 10, > "deletion_info" : { "marked_deleted" : "2022-10-19T00:00:00.000001Z", > "local_delete_time" : "2022-10-19T00:00:00.000001Z" }, > "cells" : [ ] > } > {code} > * Day20: Nothing happens for more than 10 days. Let's say the data layout on > SSTables on n1, n2, and n3 is the same as Day5 > * Day20: A new node (n4) joins the ring, and it is going to be responsible > for key "1". Let's say it streams the data from n3. The node _n3_ is supposed > to stream out SSTable1, SSTable2, and SSTable3, but it does not happen > atomically as per the streaming algorithm. Let's consider a scenario in that > _n4_ receives SSTable1 and SSTable3, but not yet SSTable2, and _n4_ compacts > SSTable1 and SSTable3. In this case, _n4_ would purge the key "1". So at this > time, there are no traces of key "1" on {_}n4{_}. After some time, SSTable2 > is streamed, and at this time it will stream the key "1" as well. > * Day20: _n4_ becomes normal > {code:java} > Query on n4: > $> CONSISTENCY LOCAL_ONE; SELECT * FROM KS1.T1 WHERE key = 1; > // 1 | 10 | 20 | 30 <-- A record is returned > Query on n1: > $> CONSISTENCY LOCAL_ONE; SELECT * FROM KS1.T1 WHERE key = 1; > // <empty> //no output{code} > > Does this make sense? > *Possible Solution* > * One of the solutions is to maybe not purge tombstones while there are > token range movements in the ring -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org