jonathan, wouldn't using Long values as the column names for the 3rd CF cause potential conflicts if 2 users liked the same # of items? (only saving one user for any given value)
was thinking about this same problem (sorted lists of top N user activity) and thought that was a roadblock for that design. -keith On Mon, Mar 8, 2010 at 7:33 PM, Jonathan Ellis <jbel...@gmail.com> wrote: > On Mon, Mar 8, 2010 at 6:18 AM, Matteo Caprari <matteo.capr...@gmail.com> > wrote: > > The 'key' queries are: > > These map straightforwardly to one CF per query. > > > - list all the items a user liked > > row key is user id, columns names are timeuuid of when the like-ing > occurred, column value is either item id, or a supercolumn containing > the denormalized item data > > > - list all the users that liked an item > > row key is item id, column names are same timeuuids, values are either > user id or again denormalized > > > - list all users and count how many items each user liked > > (we need this every few hours and in fact we are only interested in > > the top N users that liked most stuff) > > row key is something you hardcode ("topusers"), column names are Long > values of how many liked, column value is user id or denormalized user > data > > If you just need it every few hours, run a map/reduce job (Hadoop > integration in 0.6) to compute this that often. Otherwise you will > have to update it on each insert for each user which is probably a bad > idea if you have millions of users (all that activity will go to just > the machines replicating that row). And if you have tens of millions > of users you will almost certainly run into the > row-must-fit-in-memory-during-compaction limitation that we're > removing in 0.7. > > -Jonathan >