Re: schema design question
if you want to select stuff out w/ one query, then single CF is the
only sane choice
if not then 2 CFs may be more performant
On Wed, Mar 10, 2010 at 4:42 AM, Matteo Caprari
wrote:
> I can't quite decide if to go with a flat schema, with keys repeated
> in different CFs
> or have one CF with nested supercolumns.
>
> I guess there is no straight answer here, but what's a good reasoning
> about the choice?
>
> This two mutation maps should clarify my dillemma:
>
> deep_mutation_map = {
> 'example_item': {
> 'Items': [
> Mutation(SuperColumn('details', [
> Column('title', 'an article'),
> Column('link', 'www.example.com')
> ])),
> Mutation(SuperColumn('likers', [
> Column('user_1', 'xx'),
> Column('user_2', 'xx')
> ]))
> ]
> }
> }
>
> flat_mutation_map = {
> 'example_item': {
> 'Item_Info': [
> Mutation(Column('title', 'an_article')),
> Mutation(Column('link', 'www.example.com')),
> ],
> 'Item_likers': [
> Mutation(Column('user_1', 'xx')),
> Mutation(Column('user_2', 'xx'))
> ]
> }
> }
>
>
> On Tue, Mar 9, 2010 at 7:33 PM, Jonathan Ellis wrote:
>> On Tue, Mar 9, 2010 at 7:30 AM, Matteo Caprari
>> wrote:
>>> On Tue, Mar 9, 2010 at 1:23 PM, Jonathan Ellis wrote:
That's true. So you'd want to use a custom comparator where first 64
bits is the Long and the rest is the userid, for instance.
(Long + something else is common enough that we might want to add it
to the defaults...)
>>>
>>> What about using a SuperColumn for each like-count and then the list
>>> of users that hit that level?
>>
>> That would also work, it's just a little clunky pulling things out of
>> a nested structure when really you want a flat list. But if you are
>> allergic to Java that is the way to go so you don't have to write a
>> custom AbstractType subclass. :)
>>
>> -Jonathan
>>
>
>
>
> --
> :Matteo Caprari
> [email protected]
>
Re: schema design question
I can't quite decide if to go with a flat schema, with keys repeated
in different CFs
or have one CF with nested supercolumns.
I guess there is no straight answer here, but what's a good reasoning
about the choice?
This two mutation maps should clarify my dillemma:
deep_mutation_map = {
'example_item': {
'Items': [
Mutation(SuperColumn('details', [
Column('title', 'an article'),
Column('link', 'www.example.com')
])),
Mutation(SuperColumn('likers', [
Column('user_1', 'xx'),
Column('user_2', 'xx')
]))
]
}
}
flat_mutation_map = {
'example_item': {
'Item_Info': [
Mutation(Column('title', 'an_article')),
Mutation(Column('link', 'www.example.com')),
],
'Item_likers': [
Mutation(Column('user_1', 'xx')),
Mutation(Column('user_2', 'xx'))
]
}
}
On Tue, Mar 9, 2010 at 7:33 PM, Jonathan Ellis wrote:
> On Tue, Mar 9, 2010 at 7:30 AM, Matteo Caprari
> wrote:
>> On Tue, Mar 9, 2010 at 1:23 PM, Jonathan Ellis wrote:
>>> That's true. So you'd want to use a custom comparator where first 64
>>> bits is the Long and the rest is the userid, for instance.
>>>
>>> (Long + something else is common enough that we might want to add it
>>> to the defaults...)
>>
>> What about using a SuperColumn for each like-count and then the list
>> of users that hit that level?
>
> That would also work, it's just a little clunky pulling things out of
> a nested structure when really you want a flat list. But if you are
> allergic to Java that is the way to go so you don't have to write a
> custom AbstractType subclass. :)
>
> -Jonathan
>
--
:Matteo Caprari
[email protected]
Re: schema design question
Well, I don't like clunky and I'm java friendly. I'll go for the abstract class. Thanks for the help. On Tue, Mar 9, 2010 at 7:33 PM, Jonathan Ellis wrote: > On Tue, Mar 9, 2010 at 7:30 AM, Matteo Caprari > wrote: >> On Tue, Mar 9, 2010 at 1:23 PM, Jonathan Ellis wrote: >>> That's true. So you'd want to use a custom comparator where first 64 >>> bits is the Long and the rest is the userid, for instance. >>> >>> (Long + something else is common enough that we might want to add it >>> to the defaults...) >> >> What about using a SuperColumn for each like-count and then the list >> of users that hit that level? > > That would also work, it's just a little clunky pulling things out of > a nested structure when really you want a flat list. But if you are > allergic to Java that is the way to go so you don't have to write a > custom AbstractType subclass. :) > > -Jonathan > -- :Matteo Caprari [email protected]
Re: schema design question
On Tue, Mar 9, 2010 at 7:30 AM, Matteo Caprari wrote: > On Tue, Mar 9, 2010 at 1:23 PM, Jonathan Ellis wrote: >> That's true. So you'd want to use a custom comparator where first 64 >> bits is the Long and the rest is the userid, for instance. >> >> (Long + something else is common enough that we might want to add it >> to the defaults...) > > What about using a SuperColumn for each like-count and then the list > of users that hit that level? That would also work, it's just a little clunky pulling things out of a nested structure when really you want a flat list. But if you are allergic to Java that is the way to go so you don't have to write a custom AbstractType subclass. :) -Jonathan
Re: schema design question
On Tue, Mar 9, 2010 at 1:23 PM, Jonathan Ellis wrote:
> One quad-core node can handle ~14000 inserts per second so you are in
> good shape.
Well, yeah!
>> instead of 'all users that liked N items'?
>
> That's true. So you'd want to use a custom comparator where first 64
> bits is the Long and the rest is the userid, for instance.
>
> (Long + something else is common enough that we might want to add it
> to the defaults...)
What about using a SuperColumn for each like-count and then the list
of users that hit that level?
hits {
'100': [ user_1, user_2 ],
'33': [ ... ]
}
Very helpful, thanks
--
:Matteo Caprari
[email protected]
Re: schema design question
On Tue, Mar 9, 2010 at 3:53 AM, Matteo Caprari wrote: > Thanks Jonathan. > > Correct if I'm wrong: you are suggesting that each time we receive a new > row (item, [users]) we do 2 operations: > > 1) insert (or merge) this row 'as it is' (item, [users]) > 2) for each user in [users]: insert (user, [item]) > > Each incoming item is liked by 100 users, so it would be 100 db ops per item. > User ids are 20b, so it's about 2k per item sent to the database. Right. > At about 10 items/sec, we are looking at 1k db ops/sec or 20k/sec. > > Can you make a gross estimate of hardware requirements? One quad-core node can handle ~14000 inserts per second so you are in good shape. > We don't know when the like-ing happened: is there something like > incremental column names? You can use insert time, or just use a LexicalUUID. > Or can I user item_id as column name and a null-ish placeolder as value? Or that too. > I share Keith concern: if we use Long as column names, won't we end up > seeing just one user > instead of 'all users that liked N items'? That's true. So you'd want to use a custom comparator where first 64 bits is the Long and the rest is the userid, for instance. (Long + something else is common enough that we might want to add it to the defaults...) -Jonathan
Re: schema design question
Thanks Jonathan.
Correct if I'm wrong: you are suggesting that each time we receive a new
row (item, [users]) we do 2 operations:
1) insert (or merge) this row 'as it is' (item, [users])
2) for each user in [users]: insert (user, [item])
Each incoming item is liked by 100 users, so it would be 100 db ops per item.
User ids are 20b, so it's about 2k per item sent to the database.
At about 10 items/sec, we are looking at 1k db ops/sec or 20k/sec.
Can you make a gross estimate of hardware requirements?
(more inline questions below. sorry)
On Tue, Mar 9, 2010 at 3:33 AM, Jonathan Ellis wrote:
>> - list all the items a user liked
>
> row key is user id, columns names are timeuuid of when the like-ing
> occurred, column value is either item id, or a supercolumn containing
> the denormalized item data
We don't know when the like-ing happened: is there something like
incremental column names?
Or can I user item_id as column name and a null-ish placeolder as value?
>> - list all the users that liked an item
>
> row key is item id, column names are same timeuuids, values are either
> user id or again denormalized
Same problem with timeuuids as above.
>> - list all users and count how many items each user liked
>
> row key is something you hardcode ("topusers"), column names are Long
> values of how many liked, column value is user id or denormalized user
> data
I share Keith concern: if we use Long as column names, won't we end up
seeing just one user
instead of 'all users that liked N items'?
--
:Matteo Caprari
[email protected]
Re: schema design question
jonathan,
wouldn't using Long values as the column names for the 3rd CF cause
potential conflicts if 2 users liked the same # of items? (only saving one
user for any given value)
was thinking about this same problem (sorted lists of top N user activity)
and thought that was a roadblock for that design.
-keith
On Mon, Mar 8, 2010 at 7:33 PM, Jonathan Ellis wrote:
> On Mon, Mar 8, 2010 at 6:18 AM, Matteo Caprari
> wrote:
> > The 'key' queries are:
>
> These map straightforwardly to one CF per query.
>
> > - list all the items a user liked
>
> row key is user id, columns names are timeuuid of when the like-ing
> occurred, column value is either item id, or a supercolumn containing
> the denormalized item data
>
> > - list all the users that liked an item
>
> row key is item id, column names are same timeuuids, values are either
> user id or again denormalized
>
> > - list all users and count how many items each user liked
> > (we need this every few hours and in fact we are only interested in
> > the top N users that liked most stuff)
>
> row key is something you hardcode ("topusers"), column names are Long
> values of how many liked, column value is user id or denormalized user
> data
>
> If you just need it every few hours, run a map/reduce job (Hadoop
> integration in 0.6) to compute this that often. Otherwise you will
> have to update it on each insert for each user which is probably a bad
> idea if you have millions of users (all that activity will go to just
> the machines replicating that row). And if you have tens of millions
> of users you will almost certainly run into the
> row-must-fit-in-memory-during-compaction limitation that we're
> removing in 0.7.
>
> -Jonathan
>
Re: schema design question
On Mon, Mar 8, 2010 at 6:18 AM, Matteo Caprari wrote:
> The 'key' queries are:
These map straightforwardly to one CF per query.
> - list all the items a user liked
row key is user id, columns names are timeuuid of when the like-ing
occurred, column value is either item id, or a supercolumn containing
the denormalized item data
> - list all the users that liked an item
row key is item id, column names are same timeuuids, values are either
user id or again denormalized
> - list all users and count how many items each user liked
> (we need this every few hours and in fact we are only interested in
> the top N users that liked most stuff)
row key is something you hardcode ("topusers"), column names are Long
values of how many liked, column value is user id or denormalized user
data
If you just need it every few hours, run a map/reduce job (Hadoop
integration in 0.6) to compute this that often. Otherwise you will
have to update it on each insert for each user which is probably a bad
idea if you have millions of users (all that activity will go to just
the machines replicating that row). And if you have tens of millions
of users you will almost certainly run into the
row-must-fit-in-memory-during-compaction limitation that we're
removing in 0.7.
-Jonathan
