Hi Eugene, Common contributors to overlapping SSTables are 1. Hints 2. Repairs 3. New writes with old timestamps (should be rare but technically possible)
I would not run repairs with TWCS - as you indicated, it is going to result in overlapping SSTables which impacts disk space and read latency since reads now have to encompass multiple SSTables. As for https://issues.apache.org/jira/browse/CASSANDRA-13418, I would not worry about data resurrection as long as all the writes carry TTL with them. We faced similar overlapping issues with TWCS (it wss due to dclocal_read_repair_chance) - we developed an SSTable tool that would give topN or bottomN keys in an SSTable based on writetime/deletion time - we used this to identify the specific keys responsible for overlap between SSTables. Thanks, Sumanth On Mon, Oct 9, 2017 at 6:36 PM, eugene miretsky <eugene.miret...@gmail.com> wrote: > Thanks Alain! > > We are using TWCS compaction, and I read your blog multiple times - it was > very useful, thanks! > > We are seeing a lot of overlapping SSTables, leading to a lot of problems: > (a) large number of tombstones read in queries, (b) high CPU usage, (c) > fairly long Young Gen GC collection (300ms) > > We have read_repair_change = 0, and unchecked_tombstone_compaction = > true, gc_grace_seconds = 3h, but we read and write with consistency = 1. > > I'm suspecting the overlap is coming from either hinted handoff or a > repair job we run nightly. > > 1) Is running repair with TWCS recommended? It seems like it will always > create a neverending overlap (the repair SSTable will have data from all 24 > hours), an effect that seems to get amplified with anti-compaction. > 2) TWCS seems to introduce a tradeoff between eventual consistency and > write/read availability. If all repairs are turned off, then the choice is > either (a) user strong consistency level, and pay the price of lower > availability and slowers reads or writes, or (b) use lower consistency > level, and risk inconsistent data (data is never repaired) > > I will try your last link but reappearing data sound a bit scary :) > > Any advice on how to debug this further would be greatly apprecaited. > > Cheers, > Eugene > > On Fri, Oct 6, 2017 at 11:02 AM, Alain RODRIGUEZ <arodr...@gmail.com> > wrote: > >> Hi Eugene, >> >> If we never use updates (time series data), is it safe to set >>> gc_grace_seconds=0. >> >> >> As Kurt pointed, you never want 'gc_grace_seconds' to be lower than >> 'max_hint_window_in_ms' as the min off these 2 values is used for hints >> storage window size in Apache Cassandra. >> >> Yet time series data with fixed TTLs allows a very efficient use of >> Cassandra, specially when using Time Window Compaction Strategy (TWCS). >> Funny fact is that Jeff brought it to Apache Cassandra :-). I would >> definitely give it a try. >> >> Here is a post from my colleague Alex that I believe could be useful in >> your case: http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html >> >> Using TWCS and setting and lowering 'gc_grace_seconds' to the value of >> 'max_hint_window_in_ms' should be really effective. Make sure to use a >> strong consistency level (generally RF = 3, CL.Read = CL.Write = >> LOCAL_QUORUM) to prevent inconsistencies I would say (and depending on your >> interest in consistency). >> >> This way you could expire entires SSTables, without compaction. If >> overlaps in SSTables become a problem, you could even consider to give a >> try to a more aggressive SSTable expiration >> https://issues.apache.org/jira/browse/CASSANDRA-13418. >> >> C*heers, >> ----------------------- >> Alain Rodriguez - @arodream - al...@thelastpickle.com >> France / Spain >> >> The Last Pickle - Apache Cassandra Consulting >> http://www.thelastpickle.com >> >> >> >> 2017-10-05 23:44 GMT+01:00 kurt greaves <k...@instaclustr.com>: >> >>> No it's never safe to set it to 0 as you'll disable hinted handoff for >>> the table. If you are never doing updates and manual deletes and you always >>> insert with a ttl you can get away with setting it to the hinted handoff >>> period. >>> >>> On 6 Oct. 2017 1:28 am, "eugene miretsky" <eugene.miret...@gmail.com> >>> wrote: >>> >>>> Thanks Jeff, >>>> >>>> Make sense. >>>> If we never use updates (time series data), is it safe to set >>>> gc_grace_seconds=0. >>>> >>>> >>>> >>>> On Wed, Oct 4, 2017 at 5:59 PM, Jeff Jirsa <jji...@gmail.com> wrote: >>>> >>>>> >>>>> The TTL'd cell is treated as a tombstone. gc_grace_seconds applies to >>>>> TTL'd cells, because even though the data is TTL'd, it may have been >>>>> written on top of another live cell that wasn't ttl'd: >>>>> >>>>> Imagine a test table, simple key->value (k, v). >>>>> >>>>> INSERT INTO table(k,v) values(1,1); >>>>> Kill 1 of the 3 nodes >>>>> UPDATE table USING TTL 60 SET v=1 WHERE k=1 ; >>>>> 60 seconds later, the live nodes will see that data as deleted, but >>>>> when that dead node comes back to life, it needs to learn of the deletion. >>>>> >>>>> >>>>> >>>>> On Wed, Oct 4, 2017 at 2:05 PM, eugene miretsky < >>>>> eugene.miret...@gmail.com> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> The following link says that TTLs generate tombstones - >>>>>> https://docs.datastax.com/en/cql/3.3/cql/cql_using/useExpire.html. >>>>>> >>>>>> What exactly is the process that converts the TTL into a tombstone? >>>>>> >>>>>> 1. Is an actual new tombstone cell created when the TTL expires? >>>>>> 2. Or, is the TTLed cell treated as a tombstone? >>>>>> >>>>>> >>>>>> Also, does gc_grace_period have an effect on TTLed cells? >>>>>> gc_grace_period is meant to protect from deleted data re-appearing if the >>>>>> tombstone is compacted away before all nodes have reached a consistent >>>>>> state. However, since the ttl is stored in the cell (in liveness_info), >>>>>> there is no way for the cell to re-appear (the ttl will still be there) >>>>>> >>>>>> Cheers, >>>>>> Eugene >>>>>> >>>>>> >>>>> >>>> >> >