On Tue, Mar 9, 2010 at 3:53 AM, Matteo Caprari <matteo.capr...@gmail.com> wrote: > Thanks Jonathan. > > Correct if I'm wrong: you are suggesting that each time we receive a new > row (item, [users]) we do 2 operations: > > 1) insert (or merge) this row 'as it is' (item, [users]) > 2) for each user in [users]: insert (user, [item]) > > Each incoming item is liked by 100 users, so it would be 100 db ops per item. > User ids are 20b, so it's about 2k per item sent to the database.
Right. > At about 10 items/sec, we are looking at 1k db ops/sec or 20k/sec. > > Can you make a gross estimate of hardware requirements? One quad-core node can handle ~14000 inserts per second so you are in good shape. > We don't know when the like-ing happened: is there something like > incremental column names? You can use insert time, or just use a LexicalUUID. > Or can I user item_id as column name and a null-ish placeolder as value? Or that too. > I share Keith concern: if we use Long as column names, won't we end up > seeing just one user > instead of 'all users that liked N items'? That's true. So you'd want to use a custom comparator where first 64 bits is the Long and the rest is the userid, for instance. (Long + something else is common enough that we might want to add it to the defaults...) -Jonathan