Jaydeepkumar Chovatia created CASSANDRA-17991:
-------------------------------------------------
Summary: Possible data inconsistency during bootstrap/decommission
Key: CASSANDRA-17991
URL: https://issues.apache.org/jira/browse/CASSANDRA-17991
Project: Cassandra
Issue Type: Bug
Components: Consistency/Bootstrap and Decommission
Reporter: Jaydeepkumar Chovatia
I am facing one corner case in which the deleted data resurrects.
tl;dr: This could be because when we stream all the SSTables for a given token
range to the new owner, then they are not sent atomically, so the new owner
could do compaction on the partially received SSTables, which might remove the
tombstones.
Here are the reproducible steps:
+*Prerequisite*+ # Three nodes Cassandra cluster n1, n2, and n3
#
{code:java}
CREATE KEYSPACE KS1 WITH replication = {'class': 'NetworkTopologyStrategy',
'dc1': '3'};
CREATE TABLE KS1.T1 (
key int,
c1 int,
c2 int,
c3 int
PRIMARY KEY (key)
) WITH CLUSTERING ORDER BY (key ASC)
AND compaction = {'class':
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold': '32', 'min_threshold': '4'}
AND gc_grace_seconds = 864000;
{code}
*Reproducible Steps*
* Day1: Insert a new record
{code:java}
INSERT INTO KS1.T1 (key, c1, c2, c3) values (1, 10, 20, 30){code}
* Day2: Other records are inserted into this table, and it goes through
multiple compactions
* Day3: Here is the data layout on SSTables on n1, n2, and n3
{code:java}
SSTable1:
{
"partition" : {
"key" : [ "1" ],
"position" : 900
},
"rows" : [
"type" : "row",
"position" : 10,
"liveness_info" : { "tstamp" : "2022-10-15T00:00:00.000001Z"},
"cells" : [
{ "name" : "c1", "value" : 10 },
{ "name" : "c2", "value" : 20 },
{ "name" : "c3", "value" : 30 },
]
}
.....
SSTable2:
{
"partition" : {
"key" : [ "1" ],
"position" : 900
},
"rows" : [
"type" : "row",
"position" : 10,
"liveness_info" : { "tstamp" : "2022-10-15T00:00:00.000001Z"},
"cells" : [
{ "name" : "c1", "value" : 10 },
{ "name" : "c2", "value" : 20 },
{ "name" : "c3", "value" : 30 },
]
}
{code}
* Day4: Delete the record
{code:java}
CONSISTENCY ALL; DELETE FROM KS1.T1 WHERE key = 1; {code}
* Day5: Here is the data layout on SSTables on n1, n2, and n3
{code:java}
SSTable1:
{
"partition" : {
"key" : [ "1" ],
"position" : 900
},
"rows" : [
"type" : "row",
"position" : 10,
"liveness_info" : { "tstamp" : "2022-10-15T00:00:00.000001Z"},
"cells" : [
{ "name" : "c1", "value" : 10 },
{ "name" : "c2", "value" : 20 },
{ "name" : "c3", "value" : 30 },
]
}
.....
SSTable2:
{
"partition" : {
"key" : [ "1" ],
"position" : 900
},
"rows" : [
"type" : "row",
"position" : 10,
"liveness_info" : { "tstamp" : "2022-10-15T00:00:00.000001Z"},
"cells" : [
{ "name" : "c1", "value" : 10 },
{ "name" : "c2", "value" : 20 },
{ "name" : "c3", "value" : 30 },
]
}
.....
SSTable3 (Tombstone):
{
"partition" : {
"key" : [ "1" ],
"position" : 900
},
"rows" : [
"type" : "row",
"position" : 10,
"deletion_info" : { "marked_deleted" : "2022-10-19T00:00:00.000001Z",
"local_delete_time" : "2022-10-19T00:00:00.000001Z" },
"cells" : [ ]
}
{code}
* Day20: Nothing happens for more than 10 days. Let's say the data layout on
SSTables on n1, n2, and n3 is the same as Day5
* Day20: A new node (n4) joins the ring, and it is going to be responsible for
key "1". Let's say it streams the data from n3. The node _n3_ is supposed to
stream out SSTable1, SSTable2, and SSTable3, but it does not happen atomically
as per the streaming algorithm. Let's consider a scenario in that _n4_ receives
SSTable1 and SSTable3, but not yet SSTable2, and _n4_ compacts SSTable1 and
SSTable3. In this case, _n4_ would purge the key "1". So at this time, there
are no traces of key "1" on {_}n4{_}. After some time, SSTable2 is streamed,
and at this time it will stream the key "1" as well.
* Day20: _n4_ becomes normal
{code:java}
Query on n4:
$> CONSISTENCY LOCAL_ONE; SELECT * FROM KS1.T1 WHERE key = 1;
// 1 | 10 | 20 | 30 <-- A record is returned
Query on n1:
$> CONSISTENCY LOCAL_ONE; SELECT * FROM KS1.T1 WHERE key = 1;
// <empty> //no output{code}
Does this make sense?{*}{*}
*Possible Solution*
* One of the solutions is to maybe not purge tombstones while there are token
range movements in the ring
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]