[jira] [Created] (CASSANDRA-17991) Possible data inconsistency during bootstrap/decommission

Jaydeepkumar Chovatia (Jira) Tue, 25 Oct 2022 16:41:07 -0700

Jaydeepkumar Chovatia created CASSANDRA-17991:
-------------------------------------------------


             Summary: Possible data inconsistency during bootstrap/decommission
                 Key: CASSANDRA-17991
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17991
             Project: Cassandra
          Issue Type: Bug
          Components: Consistency/Bootstrap and Decommission
            Reporter: Jaydeepkumar Chovatia


I am facing one corner case in which the deleted data resurrects. 

tl;dr: This could be because when we stream all the SSTables for a given token 
range to the new owner, then they are not sent atomically, so the new owner 
could do compaction on the partially received SSTables, which might remove the 
tombstones.
 
Here are the reproducible steps:

+*Prerequisite*+ # Three nodes Cassandra cluster n1, n2, and n3
 # 
{code:java}
CREATE KEYSPACE KS1 WITH replication = {'class': 'NetworkTopologyStrategy', 
'dc1': '3'};

CREATE TABLE KS1.T1 (
    key int,
    c1 int,
    c2 int,
    c3 int
    PRIMARY KEY (key)
) WITH CLUSTERING ORDER BY (key ASC)
 AND compaction = {'class': 
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
'max_threshold': '32', 'min_threshold': '4'}
 AND gc_grace_seconds = 864000;
{code}

 

*Reproducible Steps*
 * Day1: Insert a new record
{code:java}
INSERT INTO KS1.T1 (key, c1, c2, c3) values (1, 10, 20, 30){code}

 * Day2: Other records are inserted into this table, and it goes through 
multiple compactions
 * Day3: Here is the data layout on SSTables on n1, n2, and n3 

{code:java}
SSTable1:
{
    "partition" : {
      "key" : [ "1" ],
      "position" : 900
    },
    "rows" : [
        "type" : "row",  
        "position" : 10,
        "liveness_info" : { "tstamp" : "2022-10-15T00:00:00.000001Z"},
        "cells" : [
            { "name" : "c1", "value" : 10 },
            { "name" : "c2", "value" : 20 },
            { "name" : "c3", "value" : 30 },
          ]
}
.....
SSTable2:
{
    "partition" : {
      "key" : [ "1" ],
      "position" : 900
    },
    "rows" : [
        "type" : "row",  
        "position" : 10,
        "liveness_info" : { "tstamp" : "2022-10-15T00:00:00.000001Z"},
        "cells" : [
            { "name" : "c1", "value" : 10 },
            { "name" : "c2", "value" : 20 },
            { "name" : "c3", "value" : 30 },
          ]
}
{code}
 * Day4: Delete the record
{code:java}
CONSISTENCY ALL; DELETE FROM KS1.T1 WHERE key = 1; {code}

 * Day5: Here is the data layout on SSTables on n1, n2, and n3 

{code:java}
SSTable1:
{
    "partition" : {
      "key" : [ "1" ],
      "position" : 900
    },
    "rows" : [
        "type" : "row",  
        "position" : 10,
        "liveness_info" : { "tstamp" : "2022-10-15T00:00:00.000001Z"},
        "cells" : [
            { "name" : "c1", "value" : 10 },
            { "name" : "c2", "value" : 20 },
            { "name" : "c3", "value" : 30 },
          ]
}
.....
SSTable2:
{
    "partition" : {
      "key" : [ "1" ],
      "position" : 900
    },
    "rows" : [
        "type" : "row",  
        "position" : 10,
        "liveness_info" : { "tstamp" : "2022-10-15T00:00:00.000001Z"},
        "cells" : [
            { "name" : "c1", "value" : 10 },
            { "name" : "c2", "value" : 20 },
            { "name" : "c3", "value" : 30 },
          ]
}
.....
SSTable3 (Tombstone):
{
    "partition" : {
      "key" : [ "1" ],
      "position" : 900
    },
    "rows" : [
        "type" : "row",  
        "position" : 10,
        "deletion_info" : { "marked_deleted" : "2022-10-19T00:00:00.000001Z", 
"local_delete_time" : "2022-10-19T00:00:00.000001Z" },
        "cells" : [ ]
}
{code}
 * Day20: Nothing happens for more than 10 days. Let's say the data layout on 
SSTables on n1, n2, and n3 is the same as Day5
 * Day20: A new node (n4) joins the ring, and it is going to be responsible for 
key "1". Let's say it streams the data from n3. The node _n3_ is supposed to 
stream out SSTable1, SSTable2, and SSTable3, but it does not happen atomically 
as per the streaming algorithm. Let's consider a scenario in that _n4_ receives 
SSTable1 and SSTable3, but not yet SSTable2, and _n4_ compacts SSTable1 and 
SSTable3. In this case, _n4_ would purge the key "1". So at this time, there 
are no traces of key "1" on {_}n4{_}. After some time, SSTable2 is streamed, 
and at this time it will stream the key "1" as well.
 * Day20: _n4_ becomes normal 
{code:java}
Query on n4:
$> CONSISTENCY LOCAL_ONE; SELECT * FROM KS1.T1 WHERE key = 1;
// 1 | 10 | 20 | 30 <-- A record is returned

Query on n1:
$> CONSISTENCY LOCAL_ONE; SELECT * FROM KS1.T1 WHERE key = 1;
// <empty> //no output{code}

 

Does this make sense?{*}{*}

*Possible Solution*
 * One of the solutions is to maybe not purge tombstones while there are token 
range movements in the ring



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (CASSANDRA-17991) Possible data inconsistency during bootstrap/decommission

Reply via email to