Give a deep thought on your use case. Different user tables/types may have different purge strategy based on how frequently a user account type is usually accessed, whats the user count for each user type and so on. ThanksAnuj
Sent from Yahoo Mail on Android On Fri, 22 Jan, 2016 at 11:37 pm, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote: Hi Joseph, I am personally in favour of Second approach because I dont want to do lot of IO just because a user is accessing a site several times a day. Options I see: 1.If you are on SSDs, Test LCS and update TTL of all columns at each access. This will make sure that the system can tolerate the extra IO. Advantage: No scheduling job needed. Deletion is seemless. Improved read performace than STCS. Disadvantage: To reinsert records with new TTL you would do read before write which is an Anti oattern and slow thing. Active users will cause unnecessary IO for just updating TTL.High IO due to LCS too. 2.Create a new table with user id key and last access time instead of relying on inbuilt secondary indexes. Overwrite the last access time at each access. Schedule jobs to read this table at regular intervals may be once a week and manually delete users from the main table based on the last access time. You can test using LCS with new table. Advantage: Light weight writes for updating access time. Flexibility to update deletion logic. Disadvantage: Manual scheduling job and code needs to be implemented. Scheduler would need a slow full table scan of users to know last access time. Full table scans could be done via token based parallel CQL queries for achieving performance. Using a Apache Spark job to find users to be purged would do that at tremendous speeds. Secondary indexes are not suitable and dont scale well. I would suggest dropping them. ThanksAnuj On Tue, 22 Dec, 2015 at 3:06 pm, jaalex.tech<jaalex.t...@gmail.com> wrote: Hi, I'm looking for suggestions/caveats on using TTL as a subsitute for a manual data purge job. We have few tables that hold user information - this could be guest or registered users, and there could be between 500K to 1M records created per day per table. Currently, these tables have a secondary indexed updated_date column which is populated on each update. However, we have been getting timeouts when running queries using updated_date when the number of records are high, so i don't think this would be a reliable option in the long term when we need to purge records that have not been used for the last X days. In this scenario, is it advisable to include a high enough TTL (i.e the amount of time we want these to last, could be 3 to 6 months) when inserting/updating records? There could be cases where the TTL may get reset after couple of days/weeks, when the user visits the site again. The tables have fixed number of columns, except for one which has a clustering key, and may have max 10 entries per partition key. I need to know the overhead of having so many rows with TTL hanging around for a relatively longer duration (weeks/months), and the impacts it could have on performance/storage. If this is not a recommended approach, what would be an alternate design which could be used for a manual purge job, without using secondary indices. We are using Cassandra 2.0.x. Thanks,Joseph