On 12 March 2010 03:34, Bill Au <bill.w...@gmail.com> wrote: > Let take Twitter as an example. All the tweets are timestamped. I want to > keep only a month's worth of tweets for each user. The number of tweets > that fit within this one month window varies from user to user. What is the > best way to accomplish this?
This is the "expiry" problem that has been discussed on this list before. As far as I can see there are no easy ways to do it with 0.5 If you use the ordered partitioner and make the first part of the keys a timestamp (or part of it) then you can get the keys and delete them. However, these deletes will be quite inefficient, currently each row must be deleted individually (there was a patch to range delete kicking around, I don't know if it's accepted yet) But even if range delete is implemented, it's still quite inefficient and not really what you want, and doesn't work with the RandomPartitioner If you have some metadata to say who tweeted within a given period (say 10 days or 30 days) and you store the tweets all in the same key per user per period (say with one column per tweet, or use supercolumns), then you can just delete one key per user per period. One of the problems with using a time-based key with ordered partitioner is that you're always going to have a data imbalance, so you may want to try hashing *part* of the key (The first part) so you can still range scan the next part. This may fix load balancing while still enabling you to use range scans to do data expiry. e.g. your key is Hash of day number + user id + timestamp Then you can range scan the entire day's tweets to expire them, and range scan a given user's tweets for a given day efficiently (and doing this for 30 days is just 30 range scans) Putting a hash in there fixes load balancing with OPP. Mark