[ 
https://issues.apache.org/jira/browse/CASSANDRA-13418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984514#comment-15984514
 ] 

Alexander Dejanovski commented on CASSANDRA-13418:
--------------------------------------------------

[~iksaif], I have a similar patch waiting to be tested on my laptop actually :)

The naming of the option in my version was deliberately scary so that people 
would (hopefully) give it a good thinking before using it : 
unsafe_expired_sstable_deletion

I fully agree that this option (whatever the name) should be available for TWCS 
(and why not DTCS) because the typical use case is to use TTL as a deletion 
mechanism and not explicit DELETE statements, which should be done in some rare 
cases only. If a cluster is a bit unhealthy for whatever reason, it is painful 
to see read repair forcing tens of GB of data to stay on disk because of 
timestamp overlaps.

The only possible zombie data in a correct TWCS use case (all data is written 
with TTLs) would be if the tombstone and the data it shadows are written in the 
same time window (and of course the data is missing on one node).

If the data and the tombstone live in different buckets, we'll be in the 
following scenario :  
- data is written in bucket 1 with a TTL but the write fails on one node
- the tombstone is written in bucket 2 on all nodes : data and tombstone will 
then never be compacted together since they live in different buckets. 
- In bucket 3 there is a read repair that replicates the data on the node that 
missed it, which should have been in bucket 1. It's written with the same 
timestamp/TTL and will expire at the same time than all other nodes, even if 
the tombstone is collected before (which won't happen until TTL expires). 

If the tombstone and the data it shadows live in the same bucket, and the TTL 
is longer than gc_grace_seconds, then it's indeed possible to have reappearing 
data, but even then I'm not sure it could happen : During the bucket's major 
compaction, the data and tombstone would most likely be merged and only the 
tombstone would survive, preventing the possibility of having a subsequent read 
repair to replicate the data in the next time windows. 
[~jjirsa] [~krummas] : I may be wrong here in the way compaction actually 
merges tombstones and data before gc_grace_seconds, so please correct me if 
necessary.

IMHO it is worth enduring a slight chance of reappearing data in a TTL 
workload, by choice, in order to allow optimal space savings.

After looking at your patch, it could be interesting performance wise to fully 
skip calling getOverlappingSSTables() in order to avoid searching and storing 
overlaps only to void them afterwards, by modifying this line instead : 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/compaction/TimeWindowCompactionStrategy.java#L107

To sum up : big +1, it'll help ops that try to fight with low disk space and 
don't understand why expired SSTables don't get deleted.

> Allow TWCS to ignore overlaps when dropping fully expired sstables
> ------------------------------------------------------------------
>
>                 Key: CASSANDRA-13418
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13418
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Compaction
>            Reporter: Corentin Chary
>              Labels: twcs
>
> http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html explains it well. If 
> you really want read-repairs you're going to have sstables blocking the 
> expiration of other fully expired SSTables because they overlap.
> You can set unchecked_tombstone_compaction = true or tombstone_threshold to a 
> very low value and that will purge the blockers of old data that should 
> already have expired, thus removing the overlaps and allowing the other 
> SSTables to expire.
> The thing is that this is rather CPU intensive and not optimal. If you have 
> time series, you might not care if all your data doesn't exactly expire at 
> the right time, or if data re-appears for some time, as long as it gets 
> deleted as soon as it can. And in this situation I believe it would be really 
> beneficial to allow users to simply ignore overlapping SSTables when looking 
> for fully expired ones.
> To the question: why would you need read-repairs ?
> - Full repairs basically take longer than the TTL of the data on my dataset, 
> so this isn't really effective.
> - Even with a 10% chances of doing a repair, we found out that this would be 
> enough to greatly reduce entropy of the most used data (and if you have 
> timeseries, you're likely to have a dashboard doing the same important 
> queries over and over again).
> - LOCAL_QUORUM is too expensive (need >3 replicas), QUORUM is too slow.
> I'll try to come up with a patch demonstrating how this would work, try it on 
> our system and report the effects.
> cc: [~adejanovski], [~rgerard] as I know you worked on similar issues already.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to