Re: schema design question

2010-03-10 Thread Jonathan Ellis
if you want to select stuff out w/ one query, then single CF is the
only sane choice

if not then 2 CFs may be more performant

On Wed, Mar 10, 2010 at 4:42 AM, Matteo Caprari
 wrote:
> I can't quite decide if to go with a flat schema, with keys repeated
> in different CFs
> or have one CF with nested supercolumns.
>
> I guess there is no straight answer here,  but what's a good reasoning
> about the choice?
>
> This two mutation maps should clarify my dillemma:
>
> deep_mutation_map = {
>        'example_item': {
>                'Items': [
>                        Mutation(SuperColumn('details', [
>                                Column('title', 'an article'),
>                                Column('link', 'www.example.com')
>                        ])),
>                        Mutation(SuperColumn('likers', [
>                                Column('user_1', 'xx'),
>                                Column('user_2', 'xx')
>                        ]))
>                ]
>        }
> }
>
> flat_mutation_map = {
>        'example_item': {
>                'Item_Info': [
>                        Mutation(Column('title', 'an_article')),
>                        Mutation(Column('link', 'www.example.com')),
>                ],
>                'Item_likers': [
>                        Mutation(Column('user_1', 'xx')),
>                        Mutation(Column('user_2', 'xx'))
>                ]
>        }
> }
>
>
> On Tue, Mar 9, 2010 at 7:33 PM, Jonathan Ellis  wrote:
>> On Tue, Mar 9, 2010 at 7:30 AM, Matteo Caprari  
>> wrote:
>>> On Tue, Mar 9, 2010 at 1:23 PM, Jonathan Ellis  wrote:
 That's true.  So you'd want to use a custom comparator where first 64
 bits is the Long and the rest is the userid, for instance.

 (Long + something else is common enough that we might want to add it
 to the defaults...)
>>>
>>> What about using a SuperColumn for each like-count and then the list
>>> of users that hit that level?
>>
>> That would also work, it's just a little clunky pulling things out of
>> a nested structure when really you want a flat list.  But if you are
>> allergic to Java that is the way to go so you don't have to write a
>> custom AbstractType subclass. :)
>>
>> -Jonathan
>>
>
>
>
> --
> :Matteo Caprari
> [email protected]
>


Re: schema design question

2010-03-10 Thread Matteo Caprari
I can't quite decide if to go with a flat schema, with keys repeated
in different CFs
or have one CF with nested supercolumns.

I guess there is no straight answer here,  but what's a good reasoning
about the choice?

This two mutation maps should clarify my dillemma:

deep_mutation_map = {
'example_item': {
'Items': [  

Mutation(SuperColumn('details', [
Column('title', 'an article'),
Column('link', 'www.example.com')
])),


Mutation(SuperColumn('likers', [
Column('user_1', 'xx'),
Column('user_2', 'xx')
]))
]
}
}

flat_mutation_map = {
'example_item': {   
'Item_Info': [
Mutation(Column('title', 'an_article')),
Mutation(Column('link', 'www.example.com')),
],
'Item_likers': [
Mutation(Column('user_1', 'xx')),
Mutation(Column('user_2', 'xx'))
]
}
}


On Tue, Mar 9, 2010 at 7:33 PM, Jonathan Ellis  wrote:
> On Tue, Mar 9, 2010 at 7:30 AM, Matteo Caprari  
> wrote:
>> On Tue, Mar 9, 2010 at 1:23 PM, Jonathan Ellis  wrote:
>>> That's true.  So you'd want to use a custom comparator where first 64
>>> bits is the Long and the rest is the userid, for instance.
>>>
>>> (Long + something else is common enough that we might want to add it
>>> to the defaults...)
>>
>> What about using a SuperColumn for each like-count and then the list
>> of users that hit that level?
>
> That would also work, it's just a little clunky pulling things out of
> a nested structure when really you want a flat list.  But if you are
> allergic to Java that is the way to go so you don't have to write a
> custom AbstractType subclass. :)
>
> -Jonathan
>



-- 
:Matteo Caprari
[email protected]


Re: schema design question

2010-03-10 Thread Matteo Caprari
Well, I don't like clunky and I'm java friendly. I'll go for the abstract class.

Thanks for the help.

On Tue, Mar 9, 2010 at 7:33 PM, Jonathan Ellis  wrote:
> On Tue, Mar 9, 2010 at 7:30 AM, Matteo Caprari  
> wrote:
>> On Tue, Mar 9, 2010 at 1:23 PM, Jonathan Ellis  wrote:
>>> That's true.  So you'd want to use a custom comparator where first 64
>>> bits is the Long and the rest is the userid, for instance.
>>>
>>> (Long + something else is common enough that we might want to add it
>>> to the defaults...)
>>
>> What about using a SuperColumn for each like-count and then the list
>> of users that hit that level?
>
> That would also work, it's just a little clunky pulling things out of
> a nested structure when really you want a flat list.  But if you are
> allergic to Java that is the way to go so you don't have to write a
> custom AbstractType subclass. :)
>
> -Jonathan
>



-- 
:Matteo Caprari
[email protected]


Re: schema design question

2010-03-09 Thread Jonathan Ellis
On Tue, Mar 9, 2010 at 7:30 AM, Matteo Caprari  wrote:
> On Tue, Mar 9, 2010 at 1:23 PM, Jonathan Ellis  wrote:
>> That's true.  So you'd want to use a custom comparator where first 64
>> bits is the Long and the rest is the userid, for instance.
>>
>> (Long + something else is common enough that we might want to add it
>> to the defaults...)
>
> What about using a SuperColumn for each like-count and then the list
> of users that hit that level?

That would also work, it's just a little clunky pulling things out of
a nested structure when really you want a flat list.  But if you are
allergic to Java that is the way to go so you don't have to write a
custom AbstractType subclass. :)

-Jonathan


Re: schema design question

2010-03-09 Thread Matteo Caprari
On Tue, Mar 9, 2010 at 1:23 PM, Jonathan Ellis  wrote:
> One quad-core node can handle ~14000 inserts per second so you are in
> good shape.

Well, yeah!

>> instead of 'all users that liked N items'?
>
> That's true.  So you'd want to use a custom comparator where first 64
> bits is the Long and the rest is the userid, for instance.
>
> (Long + something else is common enough that we might want to add it
> to the defaults...)

What about using a SuperColumn for each like-count and then the list
of users that hit that level?

hits {
 '100': [ user_1, user_2 ],
'33': [ ... ]
}


Very helpful, thanks
--
:Matteo Caprari
[email protected]


Re: schema design question

2010-03-09 Thread Jonathan Ellis
On Tue, Mar 9, 2010 at 3:53 AM, Matteo Caprari  wrote:
> Thanks Jonathan.
>
> Correct if I'm wrong: you are suggesting that each time we receive a new
> row (item, [users]) we do 2 operations:
>
> 1) insert (or merge) this row 'as it is' (item, [users])
> 2) for each user in [users]: insert  (user, [item])
>
> Each incoming item is liked by 100 users, so it would be 100 db ops per item.
> User ids are 20b, so it's about 2k per item sent to the database.

Right.

> At about 10 items/sec, we are looking at 1k db ops/sec or 20k/sec.
>
> Can you make a gross estimate of hardware requirements?

One quad-core node can handle ~14000 inserts per second so you are in
good shape.

> We don't know when the like-ing happened: is there something like
> incremental column names?

You can use insert time, or just use a LexicalUUID.

> Or can I user item_id as column name and a null-ish placeolder as value?

Or that too.

> I share Keith concern: if we use Long as column names, won't we end up
> seeing just one user
> instead of 'all users that liked N items'?

That's true.  So you'd want to use a custom comparator where first 64
bits is the Long and the rest is the userid, for instance.

(Long + something else is common enough that we might want to add it
to the defaults...)

-Jonathan


Re: schema design question

2010-03-09 Thread Matteo Caprari
Thanks Jonathan.

Correct if I'm wrong: you are suggesting that each time we receive a new
row (item, [users]) we do 2 operations:

1) insert (or merge) this row 'as it is' (item, [users])
2) for each user in [users]: insert  (user, [item])

Each incoming item is liked by 100 users, so it would be 100 db ops per item.
User ids are 20b, so it's about 2k per item sent to the database.

At about 10 items/sec, we are looking at 1k db ops/sec or 20k/sec.

Can you make a gross estimate of hardware requirements?

(more inline questions below. sorry)

On Tue, Mar 9, 2010 at 3:33 AM, Jonathan Ellis  wrote:

>> - list all the items a user liked
>
> row key is user id, columns names are timeuuid of when the like-ing
> occurred, column value is either item id, or a supercolumn containing
> the denormalized item data

We don't know when the like-ing happened: is there something like
incremental column names?
Or can I user item_id as column name and a null-ish placeolder as value?

>> - list all the users that liked an item
>
> row key is item id, column names are same timeuuids, values are either
> user id or again denormalized

Same problem with timeuuids as above.

>> - list all users and count how many items each user liked
>
> row key is something you hardcode ("topusers"), column names are Long
> values of how many liked, column value is user id or denormalized user
> data

I share Keith concern: if we use Long as column names, won't we end up
seeing just one user
instead of 'all users that liked N items'?

-- 
:Matteo Caprari
[email protected]


Re: schema design question

2010-03-08 Thread Keith Thornhill
jonathan,

wouldn't using Long values as the column names for the 3rd CF cause
potential conflicts if 2 users liked the same # of items? (only saving one
user for any given value)

was thinking about this same problem (sorted lists of top N user activity)
and thought that was a roadblock for that design.

-keith

On Mon, Mar 8, 2010 at 7:33 PM, Jonathan Ellis  wrote:

> On Mon, Mar 8, 2010 at 6:18 AM, Matteo Caprari 
> wrote:
> > The 'key' queries are:
>
> These map straightforwardly to one CF per query.
>
> > - list all the items a user liked
>
> row key is user id, columns names are timeuuid of when the like-ing
> occurred, column value is either item id, or a supercolumn containing
> the denormalized item data
>
> > - list all the users that liked an item
>
> row key is item id, column names are same timeuuids, values are either
> user id or again denormalized
>
> > - list all users and count how many items each user liked
> > (we need this every few hours and in fact we are only interested in
> > the top N users that liked most stuff)
>
> row key is something you hardcode ("topusers"), column names are Long
> values of how many liked, column value is user id or denormalized user
> data
>
> If you just need it every few hours, run a map/reduce job (Hadoop
> integration in 0.6) to compute this that often.  Otherwise you will
> have to update it on each insert for each user which is probably a bad
> idea if you have millions of users (all that activity will go to just
> the machines replicating that row).  And if you have tens of millions
> of users you will almost certainly run into the
> row-must-fit-in-memory-during-compaction limitation that we're
> removing in 0.7.
>
> -Jonathan
>


Re: schema design question

2010-03-08 Thread Jonathan Ellis
On Mon, Mar 8, 2010 at 6:18 AM, Matteo Caprari  wrote:
> The 'key' queries are:

These map straightforwardly to one CF per query.

> - list all the items a user liked

row key is user id, columns names are timeuuid of when the like-ing
occurred, column value is either item id, or a supercolumn containing
the denormalized item data

> - list all the users that liked an item

row key is item id, column names are same timeuuids, values are either
user id or again denormalized

> - list all users and count how many items each user liked
> (we need this every few hours and in fact we are only interested in
> the top N users that liked most stuff)

row key is something you hardcode ("topusers"), column names are Long
values of how many liked, column value is user id or denormalized user
data

If you just need it every few hours, run a map/reduce job (Hadoop
integration in 0.6) to compute this that often.  Otherwise you will
have to update it on each insert for each user which is probably a bad
idea if you have millions of users (all that activity will go to just
the machines replicating that row).  And if you have tens of millions
of users you will almost certainly run into the
row-must-fit-in-memory-during-compaction limitation that we're
removing in 0.7.

-Jonathan