Bartłomiej Romański created CASSANDRA-6909:
----------------------------------------------

             Summary: A way to expire columns without converting to tombstones
                 Key: CASSANDRA-6909
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6909
             Project: Cassandra
          Issue Type: New Feature
          Components: Core
            Reporter: Bartłomiej Romański


Imagine the following scenario. 

- You need to store some data knowing that you will need them only for limited 
time (say 7 days).
- After that you just don't care. You don't need them to be returned in the 
queries, but if they are returned that's not a problem at all - you won't look 
at them anyway.
- You records are small. Row keys and column names are even longer than the 
actual values (e.g. ints vs strings).
- You reuse rows. You add some new columns to most of the rows every day or 
two. This means that columns expire often, rows usually not.
- You generate a lot of data and want to make sure that expired records do not 
consume disk space for too long.

Current TTL feature do not handle that situation well. When compaction finally 
decides that it's worth to compact the given sstable it won't simply get rid of 
expired columns. Instead it will transform them into tombstones. In case of 
small values that's not a saving at all.

Even if you set grace period to 0 tombstones cannot be removed too early 
because some other sstable can still have values that should be "covered" by 
this tombstone. 

You can get rid of tombstone only in two cases:

- it's a major compaction (never happens with LCS, requires a lot of space in 
STCS)
- bloom filters tell you that there are no others sstable with this row key

The second option is not common if you usually have multiple columns in a 
single row that was written not at once. It's a great chance you'll have your 
row spread across multiple sstables. And from time to time a new ones are 
generated. There's very little chance they'll all meet in one compaction at 
some point. 

What's funny, bloom filters returns true if there's a tombstone for the given 
row in the given sstable. So you won't remove tombstones during compaction, 
because there's some other tombstone in another sstable for that row :/

After a while, you end up with a lot of tombstones (majority of your data) and 
can do nothing about that.

Now image that Cassandra knows that we just don't care about data older than 7 
days. 

Firstly, it can simply drop such columns during compactions (without converting 
them to tombstones or anything like that).

Secondly, if it detects an sstable older than 7 days it can safely remove it at 
all (it cannot contain any active data).

These two *guarantee* that you data will be removed after 14 days (2xTTL). If 
do compaction after 7 days, expired data will be removed. If we not, whole 
sstable will be removed after another 7 days.

That's what I expected from CASSANDRA-3974, but it turned out to be a just 
trivial, frontend feature. 

I suggest to rethink this mechanism. I don't believe that it's a common 
scenario that someone who sets TTL for whole CF need all this strong guarantees 
that data will not reappear in the future in case of some issues with 
consistency (that's why we need this whole mess with tombstones). 

I believe common case with per-CF TTL is that you just want an efficient way of 
recover you disk space (and improve reads performance by having less sstables 
and less data in general).

To work around this we currently periodically stop Cassandra, simply remove too 
old sstables, and start it back. Works OK, but does not solve problem fully (if 
tombstone is rewritten by compactions often, we will never remove it).




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to