[ 
https://issues.apache.org/jira/browse/CASSANDRA-15369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213681#comment-17213681
 ] 

Marcus Eriksson commented on CASSANDRA-15369:
---------------------------------------------

looks good in general, two concerns;
* performance of {{SinglePartitionReadCommand#reduceFilter}} is much worse now 
(a silly laptop local benchmark shows queries being 15% slower) - the reason 
seems to be that we use {{try (UnfilteredRowIterator iterator = 
result.unfilteredIterator(columnFilter(), filter.getSlices(metadata()), 
false))}} - I think we can just replace that with {{try (UnfilteredRowIterator 
iterator = result.unfilteredIterator(columnFilter(), clusterings, false))}}?
* {{AbstractBTreePartition#getRow}} - this looks like it is missing the fix 
from CASSANDRA-15363 - the {{row == null}} case should probably be
{code}
                    // this means our partition level deletion superseedes all 
other deletions and we don't have to keep the row deletions
                    if (activeDeletion == partitionDeletion)
                        return null;
                    // no need to check activeDeletion.isLive here - if 
anything superseedes the partitionDeletion
                    // it must be non-live
                    return BTreeRow.emptyDeletedRow(clustering, 
Row.Deletion.regular(activeDeletion));
{code}

> Fake row deletions and range tombstones, causing digest mismatch and sstable 
> growth
> -----------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15369
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15369
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Coordination, Local/Memtable, Local/SSTable
>            Reporter: Benedict Elliott Smith
>            Assignee: Zhao Yang
>            Priority: Normal
>             Fix For: 3.0.x, 3.11.x, 4.0-beta
>
>
> As assessed in CASSANDRA-15363, we generate fake row deletions and fake 
> tombstone markers under various circumstances:
>  * If we perform a clustering key query (or select a compact column):
>  * Serving from a {{Memtable}}, we will generate fake row deletions
>  * Serving from an sstable, we will generate fake row tombstone markers
>  * If we perform a slice query, we will generate only fake row tombstone 
> markers for any range tombstone that begins or ends outside of the limit of 
> the requested slice
>  * If we perform a multi-slice or IN query, this will occur for each 
> slice/clustering
> Unfortunately, these different behaviours can lead to very different data 
> stored in sstables until a full repair is run.  When we read-repair, we only 
> send these fake deletions or range tombstones.  A fake row deletion, 
> clustering RT and slice RT, each produces a different digest.  So for each 
> single point lookup we can produce a digest mismatch twice, and until a full 
> repair is run we can encounter an unlimited number of digest mismatches 
> across different overlapping queries.
> Relatedly, this seems a more problematic variant of our atomicity failures 
> caused by our monotonic reads, since RTs can have an atomic effect across (up 
> to) the entire partition, whereas the propagation may happen on an 
> arbitrarily small portion.  If the RT exists on only one node, this could 
> plausibly lead to fairly problematic scenario if that node fails before the 
> range can be repaired. 
> At the very least, this behaviour can lead to an almost unlimited amount of 
> extraneous data being stored until the range is repaired and compaction 
> happens to overwrite the sub-range RTs and row deletions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to