Thanks Jonathan.

Correct if I'm wrong: you are suggesting that each time we receive a new
row (item, [users]) we do 2 operations:

1) insert (or merge) this row 'as it is' (item, [users])
2) for each user in [users]: insert  (user, [item])

Each incoming item is liked by 100 users, so it would be 100 db ops per item.
User ids are 20b, so it's about 2k per item sent to the database.

At about 10 items/sec, we are looking at 1k db ops/sec or 20k/sec.

Can you make a gross estimate of hardware requirements?

(more inline questions below. sorry)

On Tue, Mar 9, 2010 at 3:33 AM, Jonathan Ellis <> wrote:

>> - list all the items a user liked
> row key is user id, columns names are timeuuid of when the like-ing
> occurred, column value is either item id, or a supercolumn containing
> the denormalized item data

We don't know when the like-ing happened: is there something like
incremental column names?
Or can I user item_id as column name and a null-ish placeolder as value?

>> - list all the users that liked an item
> row key is item id, column names are same timeuuids, values are either
> user id or again denormalized

Same problem with timeuuids as above.

>> - list all users and count how many items each user liked
> row key is something you hardcode ("topusers"), column names are Long
> values of how many liked, column value is user id or denormalized user
> data

I share Keith concern: if we use Long as column names, won't we end up
seeing just one user
instead of 'all users that liked N items'?

:Matteo Caprari

Reply via email to