[ 
https://issues.apache.org/jira/browse/CASSANDRA-14861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16671870#comment-16671870
 ] 

Blake Eggleston edited comment on CASSANDRA-14861 at 11/1/18 5:51 PM:
----------------------------------------------------------------------

|[3.0|https://github.com/bdeggleston/cassandra/tree/14861-3.0]|[circle|https://circleci.com/gh/bdeggleston/workflows/cassandra/tree/cci%2F14861-3.0]|
|[3.11|https://github.com/bdeggleston/cassandra/tree/14861-3.11]|[circle|https://circleci.com/gh/bdeggleston/workflows/cassandra/tree/cci%2F14861-3.11]|
|[trunk|https://github.com/bdeggleston/cassandra/tree/14861-trunk]|[circle|https://circleci.com/gh/bdeggleston/workflows/cassandra/tree/14861-trunk]|

This adds a minor sstable version to 3.x and changes 2 behaviors. First, when 
reading metadata for pre-md sstables, -only the first clustering value is 
loaded into the min/max values and the rest are discarded- min max values are 
discarded. When writing new sstables, the size of the min/max values written 
are limited by the length of the shortest RT clustering.

edit: min max values from legacy sstables need to be discarded, otherwise open 
ended RTs (ie: DELETE WHERE c < 100) would still have this problem.


was (Author: bdeggleston):
|[3.0|https://github.com/bdeggleston/cassandra/tree/14861-3.0]|[circle|https://circleci.com/gh/bdeggleston/workflows/cassandra/tree/cci%2F14861-3.0]|
|[3.11|https://github.com/bdeggleston/cassandra/tree/14861-3.11]|[circle|https://circleci.com/gh/bdeggleston/workflows/cassandra/tree/cci%2F14861-3.11]|
|[trunk|https://github.com/bdeggleston/cassandra/tree/14861-trunk]|[circle|https://circleci.com/gh/bdeggleston/workflows/cassandra/tree/14861-trunk]|

This adds a minor sstable version to 3.x and changes 2 behaviors. First, when 
reading metadata for pre-md sstables, only the first clustering value is loaded 
into the min/max values and the rest are discarded. When writing new sstables, 
the size of the min/max values written are limited by the length of the 
shortest RT clustering.

> sstable min/max metadata can cause data loss
> --------------------------------------------
>
>                 Key: CASSANDRA-14861
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14861
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Blake Eggleston
>            Assignee: Blake Eggleston
>            Priority: Major
>             Fix For: 3.0.18, 3.11.4, 4.0
>
>
> There’s a bug in the way we filter sstables in the read path that can cause 
> sstables containing relevant range tombstones to be excluded from reads. This 
> can cause data resurrection for an individual read, and if compaction timing 
> is right, permanent resurrection via read repair. 
> We track the min and max clustering values when writing an sstable so we can 
> avoid reading from sstables that don’t contain the clustering values we’re 
> looking for in a given read. The min max for each clustering column are 
> updated for each row / RT marker we write. In the case of range tombstones 
> markers though, we only update the min max for the clustering values they 
> contain, which is almost never the full set of clustering values. This leaves 
> a min/max that are above/below (respectively) the real ranges covered by the 
> range tombstone contained in the sstable.
> For instance, assume we’re writing an sstable for a table with 3 clustering 
> values. The current min clustering is 5:6:7. We write an RT marker for a 
> range tombstone that deletes any row with the value 4 in the first clustering 
> value so the open marker is [4:]. This would make the new min clustering 
> 4:6:7 when it should really be 4:. If we do a read for clustering values of 
> 4:5 and lower, we’ll exclude this sstable and it’s range tombstone, 
> resurrecting any data there that this tombstone would have deleted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to