[
https://issues.apache.org/jira/browse/CASSANDRA-15369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17060161#comment-17060161
]
ZhaoYang edited comment on CASSANDRA-15369 at 3/16/20, 12:18 PM:
-----------------------------------------------------------------
Changes in [trunk patch|https://github.com/apache/cassandra/pull/473/files]:
1. Return row range tombstone (aka, covering 1 row) instead row deletion when
doing single partition named query(aka. with full clustering keys) on a range
tombstone in memtable.
* The reason I chose row RT instead row deletion is that I am trying to make
named query consistent with slice query where fake RT are created for slice
clustering bound.
* For example, in a table with pk, ck1, ck2, ck3, there is existing deletion
pk=1 & ck=1 with timestamp 10. After the patch:
** query with pk=1 should return the RT as it is, as query covers origin RT.
** query with pk=1 & ck1=1 should return the RT as it is, as query covers
origin RT.
** query with pk=1 & ck1=1 & ck2=1 should return fake RT with ClusteringBound
ck1=1 & ck2=1.
** query with pk=1 & ck1=1 & ck2=1 & ck3=1 should return fake RT with
ClusteringBound ck1=1 & ck2=1 & ck3=1.
2. When partition deletion timestamp ties with range tombstone or row deletion,
it will remove RT or row deletion when creating responses.
* Prior to the patch, partition deletion will only remove row deletion with
same timestamp during compaction, via {{Row.Merger.merge()}}
During testing, I found another issues related to SPRC skipping older sstables
causing digest mismatch: CASSANDRA-15640
[~benedict] WDYT?
was (Author: jasonstack):
Changes in [trunk patch|https://github.com/apache/cassandra/pull/473/files]:
1. Return row range tombstone (aka, covering 1 row) instead row deletion when
doing single partition named query(aka. with full clustering keys) on a range
tombstone in memtable.
* The reason I chose row RT instead row deletion is that I am trying to make
named query consistent with slice query where fake RT are created for slice
clustering bound.
* For example, in a table with pk, ck1, ck2, ck3, there is existing deletion
pk=1 & ck=1 with timestamp 10. After the patch:
** query with pk=1 should return the RT as it is, as query covers origin RT.
** query with pk=1 & ck1=1 should return the RT as it is, as query covers
origin RT.
** query with pk=1 & ck1=1 & ck2=1 should return fake RT with ClusteringBound
ck1=1 & ck2=1.
** query with pk=1 & ck1=1 & ck2=1 & ck3=1 should return fake RT with
ClusteringBound ck1=1 & ck2=1 & ck3=1.
2. When partition deletion timestamp ties with range tombstone or row deletion,
it will remove RT or row deletion when creating responses.
* Prior to the patch, partition deletion will only remove row deletion with
same timestamp during compaction, via {{Row.Merger.merge()}}
During testing, I found another issues related to SPRC skipping older sstables
causing digest mismatch: CASSANDRA-15640
> Fake row deletions and range tombstones, causing digest mismatch and sstable
> growth
> -----------------------------------------------------------------------------------
>
> Key: CASSANDRA-15369
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15369
> Project: Cassandra
> Issue Type: Bug
> Components: Consistency/Coordination, Local/Memtable, Local/SSTable
> Reporter: Benedict Elliott Smith
> Assignee: ZhaoYang
> Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
>
> As assessed in CASSANDRA-15363, we generate fake row deletions and fake
> tombstone markers under various circumstances:
> * If we perform a clustering key query (or select a compact column):
> * Serving from a {{Memtable}}, we will generate fake row deletions
> * Serving from an sstable, we will generate fake row tombstone markers
> * If we perform a slice query, we will generate only fake row tombstone
> markers for any range tombstone that begins or ends outside of the limit of
> the requested slice
> * If we perform a multi-slice or IN query, this will occur for each
> slice/clustering
> Unfortunately, these different behaviours can lead to very different data
> stored in sstables until a full repair is run. When we read-repair, we only
> send these fake deletions or range tombstones. A fake row deletion,
> clustering RT and slice RT, each produces a different digest. So for each
> single point lookup we can produce a digest mismatch twice, and until a full
> repair is run we can encounter an unlimited number of digest mismatches
> across different overlapping queries.
> Relatedly, this seems a more problematic variant of our atomicity failures
> caused by our monotonic reads, since RTs can have an atomic effect across (up
> to) the entire partition, whereas the propagation may happen on an
> arbitrarily small portion. If the RT exists on only one node, this could
> plausibly lead to fairly problematic scenario if that node fails before the
> range can be repaired.
> At the very least, this behaviour can lead to an almost unlimited amount of
> extraneous data being stored until the range is repaired and compaction
> happens to overwrite the sub-range RTs and row deletions.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]