[
https://issues.apache.org/jira/browse/CASSANDRA-14279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381562#comment-16381562
]
Jeff Jirsa commented on CASSANDRA-14279:
----------------------------------------
Relevant / related mailing list post:
https://lists.apache.org/thread.html/34e980c8e1ad6c06e28f99139f9bdec9878eb004da056a17774d0ad3@%3Cdev.cassandra.apache.org%3E
> Row Tombstones in separate sstables / separate compaction path
> --------------------------------------------------------------
>
> Key: CASSANDRA-14279
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14279
> Project: Cassandra
> Issue Type: Improvement
> Components: Compaction, Local Write-Read Paths, Repair
> Reporter: Constance Eustace
> Priority: Major
>
> In my experience if data is not well organized into time windowed sstables,
> cassandra has enormous difficulty in actually deleting data if the data has a
> "medium term" lifetime and is commingled with data that isn't marked for
> death, as would happen with compactions or intermingled write patterns. Or
> for example, you might have an active working set and be archiving "unused"
> data to other tables or clusters. Or you may be purging data. Or you may be
> migrating/sharding/restructuring data. Whatever the case, you want that disk
> space back, and you might not be able to truncate.
> In STCS and LCS, row tombstones are intermingled with column data and column
> tombstones. But a row tombstone represents a significant event in data
> lifecycle: large amounts of "droppable" data during compaction and a shortcut
> from reading data from other sstables. It could also enable writes to be
> discarded in rare data patterns if the row tombstone is ahead in time.
> I am wondering that if row tombstones were isolated in their own sstables,
> separately compacted and merged, that it might enable compaction to work more
> efficiently:
> reads can prioritize bloom filter lookups that indicate a row tombstone,
> getting the timestamp of the deletion first, then can use that in the data
> sstables to filter data or shortcircuit the data if the row data had an
> overall "most recent data timestamp".
> compaction could be forced to reference all the row tombstone sstables, such
> that every time two or more "data" sstables are compacted, they must
> reference the row tombstones to purge data.
> In LCS, this would be particularly useful in getting data out of the upper
> levels without having to wait for data to trickle up the tree. The row
> tombstones, being read-only inputs into the data sstable compactions, can be
> referenced in each of the LCS levels' parallel compactors.
> Based on discussions in the dev list, this would appear to require some sort
> of customization to the memtable->sstable flushing process, and perhaps a
> different set of bloom filters.
> Since the row tombstone sstables are all <rowkey>,<tombstone timestamp>, they
> should be comparitively smaller and take less time to compact. They could be
> aggressively compacted on a different schedule than "data" sstables.
> In addition, it may be easier to repair/synchronize row tombstones across the
> cluster if they have already been separated into their own sstables.
> Column/range tombstones may also benefit from a similar separation, but my
> guess is those are much more numerous and large and fine-grained that they
> might as well coexist with the data.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]