Hi Scott,

Until Riak gains the ability to constrain list traversals by bucket this will 
continue to be a point of friction. This issue has been broached before and 
there are tickets open on the issues tracking site. As I understand it, one 
solution would potentially modify bitcask to open a 'cask' per bucket. However, 
nothing comes for free and this would come at the expense of file descriptors 
at the os level thereby introducing a constraint on the number of buckets in a 
cluster. This is similar to how the inno backend currently operates, as Sean 
pointed out.

Recognizing this constraint and how you can mitigate it really depends on your 
use case. I hate to sound like a broken record, but recent improvements to key 
traversal notwithstanding, I have been using redis as an intermediary key list 
manager. Augmenting that further I will pull key lists out of redis and write 
them to riak either by cron or explicitly by user action. Admittedly my volume 
is not at a level where this is a considerable problem at the moment. Then 
again, I don't think it ever will be (for my use case). I'm not trying to crawl 
the world or build the next twitter or facebook. That said, what I do is pull 
this key list out of redis (or riak), generate an appropriate inputs array and 
feed that to the mapreduce function. I should note at the moment I do this in 
javascript for ease of development. Other big wins in my book on using redis 
instead of riak for list management is that redis understands certain data 
primitives whereas riak is data agnostic. What this means practically is that 
you can push/pull/pop/slice data in redis (among other things). You just can 
not do that in riak. Data must be written atomically as in if you have a meg 
you write a meg. There are no diff updates in riak. 

Performance wise, the first thing you are going to want to look at if and when 
optimization becomes a concern is moving from the http interface to the 
protobuf interface. After that I would look into rewriting your mapreduce in 
erlang. Marshaling complex data between the native erlang internals and the 
javascript interpreter has a non zero cost associated with it. Forgoing this 
step is a big win. Again, I view all this as a growth path within the riak 
environment and "a good thing" (tm). 

Assuming your most populous zip codes may have on the order of ~200k 
subscribers you could encode your user keys in radix 62 and be able to fit 
those keys in a 3 character space. Move up to 4 characters for way more leg 
room. At 3 characters (standard 8 bit encoding) your ~200k key list is under 1 
MB (something to consider based on how riak allocates ram for this portion of 
the mapreduce in erlang and/or js). Also, I'm a big fan of fixed length keys 
for unrolled loops. Either way, feeding keys explicitly to a mapreduce will 
only get better as your input list shrinks in relation to the total keys in 
your system. Data modeling wise, I would have a user bucket a zip codes bucket 
and a zip_users bucket and the converse, users_zip bucket. The later two having 
the keys of the former as members. I'm also a big fan of explicitly derived 
keys/paths. I would not recommend links here simply because of the unbounded, 
potentially large nature of your problem.

Do keep us posted,

Alexander


On Sep 16, 2010, at 2:49 PM, Scott wrote:

> Thanks for the quick replies Sean and Alexander.  One of our current products 
> allows users to sign up for weather alerts based on their zip code.  When we 
> receive a weather alert for a set of locations, we need to quickly find all 
> users in the zip codes effected. We currently do this with a simple sql query 
> against a relational db.  Being new at this key/value store thing, we are not 
> sure the best way to tackle this with Riak.
> 
> Some zip codes have over 20,000 users, so storing the users in a json array 
> with the zip code as the key would get ugly fast.  One thought was to store 
> the user profiles in one bucket, and then add an key per user in the correct 
> zip code bucket, perhaps with a link back to the users record in the profile 
> bucket.  We could then fetch the keys for the effected zip codes using map 
> reduce.  I am open to all suggestions on how to best model this type of data 
> in Riak.
> 
> Thanks,
> Scott
> 
> 
> Sean Cribbs wrote:
>> Scott,
>> 
>> There is no limit on the number of buckets unless you are changing the 
>> bucket properties, like the replication factor, allow_mult, or the pre- and 
>> post-commit hooks.  Buckets that have properties other than the defaults 
>> consume space in the ring state.  Other than that, they are essentially free 
>> unless you're using a backend that segregates data by bucket - the only one 
>> that does at this time is innostore.
>> 
>> Is there a reason you need so many buckets? 
>> 
>> Sean Cribbs <[email protected]>
>> Developer Advocate
>> Basho Technologies, Inc.
>> http://basho.com/
>> 
>> On Sep 16, 2010, at 2:17 PM, SKester wrote:
>> 
>>> Is there a practical (or hard) limit to the number of buckets a riak 
>>> cluster can handle?  One possible data model we could use for one 
>>> application could result in ~80,000 buckets.  Is that a reasonable number?
>>> 
>>> Thanks,
>>> Scott
>>> _______________________________________________
>>> riak-users mailing list
>>> [email protected]
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> 
> _______________________________________________
> riak-users mailing list
> [email protected]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to