The most likely cause is read repairs due to consistency level repairs
(digest mismatch). The only way to actually eliminate read repair is to
read with CL:ONE, which almost nobody does (at least in time series use
cases, because it implies you probably write with ALL, or run repair which
- as you've noted - often isn't necessary in ttl-only use cases).

I can't see the image, but more tools for understanding sstable state are
never a bad thing (as long as they're generally useful and maintainable).

For what it's worth, there are tickets in flight for being more aggressive
at dropping overlaps, but there are companies that use tools that stop the
cluster, use sstablemetadata to identify sstables we knew should be fully
expired, and manually remove them (/bin/rm) before starting cassandra
again. It works reasonably well IF (and only if) you write all data with
TTLs, and you can identify fully expired sstables based on maximum
timestamps.




On Tue, Aug 8, 2017 at 8:51 PM, Sumanth Pasupuleti <
sumanth.pasupuleti...@gmail.com> wrote:

> Hi,
>>
>> We use TWCS in a few of the column families that have TTL based
>> time-series data, and no explicit deletes are issued. Over the time, we
>> observed the disk usage has been increasing beyond the expected levels.
>>
>> Data directory in a particular node shows SSTables that are more than
>> 16days old, while the bucket size is configured at 12hours, TTL is at
>> 15days and GC grace at 1hour.
>> Upon using sstableexpiredblockers, we got quite a few sets of blocking
>> and blocked SSTables. SSTableMetadata that is shown in the output indicates
>> there is an overlap in the MinTS-MaxTS period among the blocking SSTable
>> and the blocked SSTables, which is preventing the older SSTables from
>> getting dropped/deleted.
>>
>> Following are the possible root causes we considered
>>
>>    1. Hints - old data hints getting replayed from the coordinator node.
>>    We ruled this out since hints live for no more than 1 day based on our
>>    configuration.
>>    2. External compactions - no external compactions were run, that
>>    could cause compaction of SSTables across the TWCS buckets.
>>    3.  Read repairs - this is ruled out as well, since we never ran
>>    external repairs, and read repair chance on the TWCS column families has
>>    been set to 0.
>>    4.  Application team writing data with older timestamp (in newer
>>    SSTables).
>>
>>
>>    1. We wanted to identify the specific row keys with older timestamps
>>       in the blocking SSTable, that could be causing this issue to occur. We
>>       considered using SSTable2Keys/json, however, since both the tools 
>> involve
>>       outputting the entire content/keys of the SSTable in the order of the 
>> keys,
>>       they were not helpful in this case.
>>       2. Since we wanted to get data on a few oldest cells with
>>       timestamps, we created a tool mostly based off of sstable2json, called
>>       sstableslicer, to output 'n' top/bottom cells in an SSTable, ordered 
>> either
>>       on writetime/localDeletionTime. This helped us identify the specific 
>> cells
>>       in new SSTables with older timestamps, which further helped in 
>> debugging on
>>       the application end. From application team perspective, however, 
>> writing
>>       data with old timestamp is not a possible scenario.
>>
>>    3. Below is a sample output of sstableslicer
> [image: Inline image 2]
>
>
>> Looking for suggestions, especially around following two things:
>>
>>    1. Did we miss any other case in TWCS that could be causing such
>>    overlap?
>>    2. Does sstableslicer seem valuable, to be included in Apache C*? If
>>    yes, I shall create a JIRA and submit a PR/patch for review.
>>
>> C* version we use is 2.1.17.
>
> Thanks,
>> Sumanth
>>
>

Reply via email to