Hi Scott, Until Riak gains the ability to constrain list traversals by bucket this will continue to be a point of friction. This issue has been broached before and there are tickets open on the issues tracking site. As I understand it, one solution would potentially modify bitcask to open a 'cask' per bucket. However, nothing comes for free and this would come at the expense of file descriptors at the os level thereby introducing a constraint on the number of buckets in a cluster. This is similar to how the inno backend currently operates, as Sean pointed out.
Recognizing this constraint and how you can mitigate it really depends on your use case. I hate to sound like a broken record, but recent improvements to key traversal notwithstanding, I have been using redis as an intermediary key list manager. Augmenting that further I will pull key lists out of redis and write them to riak either by cron or explicitly by user action. Admittedly my volume is not at a level where this is a considerable problem at the moment. Then again, I don't think it ever will be (for my use case). I'm not trying to crawl the world or build the next twitter or facebook. That said, what I do is pull this key list out of redis (or riak), generate an appropriate inputs array and feed that to the mapreduce function. I should note at the moment I do this in javascript for ease of development. Other big wins in my book on using redis instead of riak for list management is that redis understands certain data primitives whereas riak is data agnostic. What this means practically is that you can push/pull/pop/slice data in redis (among other things). You just can not do that in riak. Data must be written atomically as in if you have a meg you write a meg. There are no diff updates in riak. Performance wise, the first thing you are going to want to look at if and when optimization becomes a concern is moving from the http interface to the protobuf interface. After that I would look into rewriting your mapreduce in erlang. Marshaling complex data between the native erlang internals and the javascript interpreter has a non zero cost associated with it. Forgoing this step is a big win. Again, I view all this as a growth path within the riak environment and "a good thing" (tm). Assuming your most populous zip codes may have on the order of ~200k subscribers you could encode your user keys in radix 62 and be able to fit those keys in a 3 character space. Move up to 4 characters for way more leg room. At 3 characters (standard 8 bit encoding) your ~200k key list is under 1 MB (something to consider based on how riak allocates ram for this portion of the mapreduce in erlang and/or js). Also, I'm a big fan of fixed length keys for unrolled loops. Either way, feeding keys explicitly to a mapreduce will only get better as your input list shrinks in relation to the total keys in your system. Data modeling wise, I would have a user bucket a zip codes bucket and a zip_users bucket and the converse, users_zip bucket. The later two having the keys of the former as members. I'm also a big fan of explicitly derived keys/paths. I would not recommend links here simply because of the unbounded, potentially large nature of your problem. Do keep us posted, Alexander On Sep 16, 2010, at 2:49 PM, Scott wrote: > Thanks for the quick replies Sean and Alexander. One of our current products > allows users to sign up for weather alerts based on their zip code. When we > receive a weather alert for a set of locations, we need to quickly find all > users in the zip codes effected. We currently do this with a simple sql query > against a relational db. Being new at this key/value store thing, we are not > sure the best way to tackle this with Riak. > > Some zip codes have over 20,000 users, so storing the users in a json array > with the zip code as the key would get ugly fast. One thought was to store > the user profiles in one bucket, and then add an key per user in the correct > zip code bucket, perhaps with a link back to the users record in the profile > bucket. We could then fetch the keys for the effected zip codes using map > reduce. I am open to all suggestions on how to best model this type of data > in Riak. > > Thanks, > Scott > > > Sean Cribbs wrote: >> Scott, >> >> There is no limit on the number of buckets unless you are changing the >> bucket properties, like the replication factor, allow_mult, or the pre- and >> post-commit hooks. Buckets that have properties other than the defaults >> consume space in the ring state. Other than that, they are essentially free >> unless you're using a backend that segregates data by bucket - the only one >> that does at this time is innostore. >> >> Is there a reason you need so many buckets? >> >> Sean Cribbs <[email protected]> >> Developer Advocate >> Basho Technologies, Inc. >> http://basho.com/ >> >> On Sep 16, 2010, at 2:17 PM, SKester wrote: >> >>> Is there a practical (or hard) limit to the number of buckets a riak >>> cluster can handle? One possible data model we could use for one >>> application could result in ~80,000 buckets. Is that a reasonable number? >>> >>> Thanks, >>> Scott >>> _______________________________________________ >>> riak-users mailing list >>> [email protected] >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> > _______________________________________________ > riak-users mailing list > [email protected] > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com _______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
