Re: How do TTLs generate tombstones

Jeff Jirsa Wed, 11 Oct 2017 09:54:19 -0700

Anti-entropy repairs ("nodetool repair") and bootstrap/decom/removenode
should stream sections of (and/or possibly entire) sstables from one
replica to another. Assuming the original sstable was entirely contained in
a single time window, the resulting sstable fragment streamed to the
neighbor node will similarly be entirely contained within a single time
window, and will be joined with the sstables in that window. If you find
this isn't the case, open a JIRA, that's a bug (it was explicitly a design
goal of TWCS, as it was one of my biggest gripes with early versions of
DTCS).


Read repairs, however, will pollute the memtable and cause overlaps. There
are two types of read repairs:
- Blocking read repair due to consistency level (read at quorum, and one of
the replicas is missing data, the coordinator will issue mutations to the
missing replica, which will go into the memtable and flush into the newest
time window). This can not be disabled (period), and is probably the reason
most people have overlaps (because people tend to read their writes pretty
quickly after writes in time series use cases, often before hints or normal
repair can be successful, especially in environments where nodes are
bounced often).
- Background read repair (tunable with the read_repair_chance and
dclocal_read_repair_chance table options), which is like blocking read
repair, but happens probabilistically (ie: there's a 1% chance on any read
that the coordinator will scan the partition and copy any missing data to
the replicas missing that data. Again, this goes to the memtable, and will
flush into the newest time window).

There's a pretty good argument to be made against manual repairs if (and
only if) you only use TTLs, never explicitly delete data, and can tolerate
the business risk of losing two machines at a time (that is: in the very
very rare case that you somehow lose 2 machines before you can rebuild,
you'll lose some subset of data that never made it to the sole remaining
replica; is your business going to lose millions of dollars, or will you
just have a gap in an analytics dashboard somewhere that nobody's going to
worry about).

- Jeff


On Wed, Oct 11, 2017 at 9:24 AM, Sumanth Pasupuleti <
spasupul...@netflix.com.invalid> wrote:

> Hi Eugene,
>
> Common contributors to overlapping SSTables are
> 1. Hints
> 2. Repairs
> 3. New writes with old timestamps (should be rare but technically possible)
>
> I would not run repairs with TWCS - as you indicated, it is going to
> result in overlapping SSTables which impacts disk space and read latency
> since reads now have to encompass multiple SSTables.
>
> As for https://issues.apache.org/jira/browse/CASSANDRA-13418, I would not
> worry about data resurrection as long as all the writes carry TTL with them.
>
> We faced similar overlapping issues with TWCS (it wss due to
> dclocal_read_repair_chance) - we developed an SSTable tool that would give
> topN or bottomN keys in an SSTable based on writetime/deletion time - we
> used this to identify the specific keys responsible for overlap between
> SSTables.
>
> Thanks,
> Sumanth
>
>
> On Mon, Oct 9, 2017 at 6:36 PM, eugene miretsky <eugene.miret...@gmail.com
> > wrote:
>
>> Thanks Alain!
>>
>> We are using TWCS compaction, and I read your blog multiple times - it
>> was very useful, thanks!
>>
>> We are seeing a lot of overlapping SSTables, leading to a lot of
>> problems: (a) large number of tombstones read in queries, (b) high CPU
>> usage, (c) fairly long Young Gen GC collection (300ms)
>>
>> We have read_repair_change = 0, and unchecked_tombstone_compaction =
>> true, gc_grace_seconds = 3h,  but we read and write with consistency =
>> 1.
>>
>> I'm suspecting the overlap is coming from either hinted handoff or a
>> repair job we run nightly.
>>
>> 1) Is running repair with TWCS recommended? It seems like it will always
>> create a neverending overlap (the repair SSTable will have data from all 24
>> hours), an effect that seems to get amplified with anti-compaction.
>> 2) TWCS seems to introduce a tradeoff between eventual consistency and
>> write/read availability. If all repairs are turned off, then the choice is
>> either (a) user strong consistency level, and pay the price of lower
>> availability and slowers reads or writes, or (b) use lower consistency
>> level, and risk inconsistent data (data is never repaired)
>>
>> I will try your last link but reappearing data sound a bit scary :)
>>
>> Any advice on how to debug this further would be greatly apprecaited.
>>
>> Cheers,
>> Eugene
>>
>> On Fri, Oct 6, 2017 at 11:02 AM, Alain RODRIGUEZ <arodr...@gmail.com>
>> wrote:
>>
>>> Hi Eugene,
>>>
>>> If we never use updates (time series data), is it safe to set
>>>> gc_grace_seconds=0.
>>>
>>>
>>> As Kurt pointed, you never want 'gc_grace_seconds' to be lower than
>>> 'max_hint_window_in_ms' as the min off these 2 values is used for hints
>>> storage window size in Apache Cassandra.
>>>
>>> Yet time series data with fixed TTLs allows a very efficient use of
>>> Cassandra, specially when using Time Window Compaction Strategy (TWCS).
>>> Funny fact is that Jeff brought it to Apache Cassandra :-). I would
>>> definitely give it a try.
>>>
>>> Here is a post from my colleague Alex that I believe could be useful in
>>> your case: http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html
>>>
>>> Using TWCS and setting and lowering 'gc_grace_seconds' to the value of
>>> 'max_hint_window_in_ms' should be really effective. Make sure to use a
>>> strong consistency level (generally RF = 3, CL.Read = CL.Write =
>>> LOCAL_QUORUM) to prevent inconsistencies I would say (and depending on your
>>> interest in consistency).
>>>
>>> This way you could expire entires SSTables, without compaction. If
>>> overlaps in SSTables become a problem, you could even consider to give a
>>> try to a more aggressive SSTable expiration
>>> https://issues.apache.org/jira/browse/CASSANDRA-13418.
>>>
>>> C*heers,
>>> -----------------------
>>> Alain Rodriguez - @arodream - al...@thelastpickle.com
>>> France / Spain
>>>
>>> The Last Pickle - Apache Cassandra Consulting
>>> http://www.thelastpickle.com
>>>
>>>
>>>
>>> 2017-10-05 23:44 GMT+01:00 kurt greaves <k...@instaclustr.com>:
>>>
>>>> No it's never safe to set it to 0 as you'll disable hinted handoff for
>>>> the table. If you are never doing updates and manual deletes and you always
>>>> insert with a ttl you can get away with setting it to the hinted handoff
>>>> period.
>>>>
>>>> On 6 Oct. 2017 1:28 am, "eugene miretsky" <eugene.miret...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks Jeff,
>>>>>
>>>>> Make sense.
>>>>> If we never use updates (time series data), is it safe to set
>>>>> gc_grace_seconds=0.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Oct 4, 2017 at 5:59 PM, Jeff Jirsa <jji...@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> The TTL'd cell is treated as a tombstone. gc_grace_seconds applies to
>>>>>> TTL'd cells, because even though the data is TTL'd, it may have been
>>>>>> written on top of another live cell that wasn't ttl'd:
>>>>>>
>>>>>> Imagine a test table, simple key->value (k, v).
>>>>>>
>>>>>> INSERT INTO table(k,v) values(1,1);
>>>>>> Kill 1 of the 3 nodes
>>>>>> UPDATE table USING TTL 60 SET v=1 WHERE k=1 ;
>>>>>> 60 seconds later, the live nodes will see that data as deleted, but
>>>>>> when that dead node comes back to life, it needs to learn of the 
>>>>>> deletion.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Oct 4, 2017 at 2:05 PM, eugene miretsky <
>>>>>> eugene.miret...@gmail.com> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> The following link says that TTLs generate tombstones -
>>>>>>> https://docs.datastax.com/en/cql/3.3/cql/cql_using/useExpire.html.
>>>>>>>
>>>>>>> What exactly is the process that converts the TTL into a tombstone?
>>>>>>>
>>>>>>>    1. Is an actual new tombstone cell created when the TTL expires?
>>>>>>>    2. Or, is the TTLed cell treated as a tombstone?
>>>>>>>
>>>>>>>
>>>>>>> Also, does gc_grace_period have an effect on TTLed cells?
>>>>>>> gc_grace_period is meant to protect from deleted data re-appearing if 
>>>>>>> the
>>>>>>> tombstone is compacted away before all nodes have reached a consistent
>>>>>>> state. However, since the ttl is stored in the cell (in liveness_info),
>>>>>>> there is no way for the cell to re-appear (the ttl will still be there)
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Eugene
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>
>

Re: How do TTLs generate tombstones

Reply via email to