On 12 March 2010 03:34, Bill Au <bill.w...@gmail.com> wrote:

> Let take Twitter as an example.  All the tweets are timestamped.  I want to
> keep only a month's worth of tweets for each user.  The number of tweets
> that fit within this one month window varies from user to user.  What is the
> best way to accomplish this?

This is the "expiry" problem that has been discussed on this list before. As
far as I can see there are no easy ways to do it with 0.5

If you use the ordered partitioner and make the first part of the keys a
timestamp (or part of it) then you can get the keys and delete them.

However, these deletes will be quite inefficient, currently each row must be
deleted individually (there was a patch to range delete kicking around, I
don't know if it's accepted yet)

But even if range delete is implemented, it's still quite inefficient and
not really what you want, and doesn't work with the RandomPartitioner

If you have some metadata to say who tweeted within a given period (say 10
days or 30 days) and you store the tweets all in the same key per user per
period (say with one column per tweet, or use supercolumns), then you can
just delete one key per user per period.

One of the problems with using a time-based key with ordered partitioner is
that you're always going to have a data imbalance, so you may want to try
hashing *part* of the key (The first part) so you can still range scan the
next part. This may fix load balancing while still enabling you to use range
scans to do data expiry.

e.g. your key is

Hash of day number + user id + timestamp

Then you can range scan the entire day's tweets to expire them, and range
scan a given user's tweets for a given day efficiently (and doing this for
30 days is just 30 range scans)

Putting a hash in there fixes load balancing with OPP.


Reply via email to