Re: Using TTL for data purge

Anuj Wadehra Fri, 22 Jan 2016 10:47:15 -0800

Give a deep thought on your use case. Different user tables/types may have 
different purge strategy based on how frequently a user account type is usually 
accessed, whats the user count for each user type and so on.
ThanksAnuj


Sent from Yahoo Mail on Android 
 
  On Fri, 22 Jan, 2016 at 11:37 pm, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote: 
  Hi Joseph,

I am personally in favour of Second approach because I dont want to do lot of 
IO just because a user is accessing a site several times a day. 
Options I see:

1.If you are on SSDs, Test LCS and update TTL of all columns at each access. 
This will make sure that the system can tolerate the extra IO. Advantage: No 
scheduling job needed. Deletion is seemless. Improved read performace than STCS.
Disadvantage: To reinsert records with new TTL you would do read before write 
which is an Anti oattern and slow thing. Active users will cause unnecessary IO 
for just updating TTL.High IO due to LCS too.
2.Create a new table with user id key and last access time instead of relying 
on inbuilt secondary indexes. Overwrite the last access time at each access. 
Schedule jobs to read this table at regular intervals may be once a week and 
manually delete users from the main table based on the last access time. You 
can test using LCS with new table.
Advantage: Light weight writes for updating access time. Flexibility to update 
deletion logic.
Disadvantage: Manual scheduling job and code needs to be implemented. Scheduler 
would need a slow full table scan of users to know last access time. Full table 
scans could be done via token based parallel CQL queries for achieving 
performance. Using a Apache Spark job to find users to be purged would do that 
at tremendous speeds.
Secondary indexes are not suitable and dont scale well. I would suggest 
dropping them.



ThanksAnuj




 
 
  On Tue, 22 Dec, 2015 at 3:06 pm, jaalex.tech<jaalex.t...@gmail.com> wrote:   
Hi,
I'm looking for suggestions/caveats on using TTL as a subsitute for a manual 
data purge job. 
We have few tables that hold user information - this could be guest or 
registered users, and there could be between 500K to 1M records created per day 
per table. Currently, these tables have a secondary indexed updated_date column 
which is populated on each update. However, we have been getting timeouts when 
running queries using updated_date when the number of records are high, so i 
don't think this would be a reliable option in the long term when we need to 
purge records that have not been used for the last X days. 
In this scenario, is it advisable to include a high enough TTL (i.e the amount 
of time we want these to last, could be 3 to 6 months) when inserting/updating 
records? 
There could be cases where the TTL may get reset after couple of days/weeks, 
when the user visits the site again.
The tables have fixed number of columns, except for one which has a clustering 
key, and may have max 10 entries per  partition key.
I need to know the overhead of having so many rows with TTL hanging around for 
a relatively longer duration (weeks/months), and the impacts it could have on 
performance/storage. If this is not a recommended approach, what would be an 
alternate design which could be used for a manual purge job, without using 
secondary indices.
We are using Cassandra 2.0.x.
Thanks,Joseph

Re: Using TTL for data purge

Reply via email to