Sylvain Lebresne created CASSANDRA-5183:
-------------------------------------------
Summary: Improve cases where we purge tombstone on (minor)
compaction
Key: CASSANDRA-5183
URL: https://issues.apache.org/jira/browse/CASSANDRA-5183
Project: Cassandra
Issue Type: Improvement
Reporter: Sylvain Lebresne
Priority: Minor
Currently, to be able to purge a tombstone, we check that the row it is part of
is not present in a non-compacted sstable, as we should not remove a tombstone
that may delete other columns in the non-compacted sstables.
The (known) problem is, if you regularly update a row on which you've made
deletes, tombstone may theoretically be kept forever unless you run a major
compaction (which is bad and not even a possibility with leveled compaction).
In practice, with wide rows and more precisely time-series type of load, it is
not unlikely that tombstones might be kept, if not forever, at least much
longer than gcgrace.
One avoid to improve on that would be to start storing the minTimestamp of
sstables (like we keep the maxTimestamp). During compaction, on top checking
bloom filters, we would also check if the max timestamp of what we're about to
purge is smaller than the min timestamp of the non compact sstable. If it is,
then whatever tombstone we are looking at cannot shadow something in the
non-compacted sstable and we can purge it (that is, even if the row in question
may have columns in those non-compacted sstables).
Note that while this isn't perfect in theory:
# this is cheap to check. We may even compute the min timestamp of the non
compacted sstable once at the beginning of the compaction and check that before
looking at the BF, which may save a few intervalTree search (if we do end up
doing the intervalTree search however, we might still want recomputing the min
timestamp of the returned sstable as this may be bigger that the min timestamp
of all the non compacted sstables).
# both size tiered and leveled natural tend to compact sstable having data of
rougthly the same age, so this should work reasonably well.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira