Sylvain Lebresne created CASSANDRA-5183:
-------------------------------------------

             Summary: Improve cases where we purge tombstone on (minor) 
compaction
                 Key: CASSANDRA-5183
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5183
             Project: Cassandra
          Issue Type: Improvement
            Reporter: Sylvain Lebresne
            Priority: Minor


Currently, to be able to purge a tombstone, we check that the row it is part of 
is not present in a non-compacted sstable, as we should not remove a tombstone 
that may delete other columns in the non-compacted sstables.

The (known) problem is, if you regularly update a row on which you've made 
deletes, tombstone may theoretically be kept forever unless you run a major 
compaction (which is bad and not even a possibility with leveled compaction).

In practice, with wide rows and more precisely time-series type of load, it is 
not unlikely that tombstones might be kept, if not forever, at least much 
longer than gcgrace.

One avoid to improve on that would be to start storing the minTimestamp of 
sstables (like we keep the maxTimestamp). During compaction, on top checking 
bloom filters, we would also check if the max timestamp of what we're about to 
purge is smaller than the min timestamp of the non compact sstable. If it is, 
then whatever tombstone we are looking at cannot shadow something in the 
non-compacted sstable and we can purge it (that is, even if the row in question 
may have columns in those non-compacted sstables).

Note that while this isn't perfect in theory:
# this is cheap to check. We may even compute the min timestamp of the non 
compacted sstable once at the beginning of the compaction and check that before 
looking at the BF, which may save a few intervalTree search (if we do end up 
doing the intervalTree search however, we might still want recomputing the min 
timestamp of the returned sstable as this may be bigger that the min timestamp 
of all the non compacted sstables).
# both size tiered and leveled natural tend to compact sstable having data of 
rougthly the same age, so this should work reasonably well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to