First, check out Riak Search. It can sort your records by dates and your can use it for filtering on secondary indexes. If your criteria for listing the records is pretty simple and can be expressed with a search query, that may be your best bet. For instance, listing a user's blog entries is as simple as "author_id:some-uuid-or-something".
Another approach, which I have had to take, is to keep a single index document that is a list of keys in the order I want them in. I then use a map/reduce query to load, parse and slice the document in the map phase and then return a list of keys to my application. This is pretty fast because it's all happening in the node the data is stored on and therefore I don't have to worry about the network latency of transferring what could be a giant document. Since the keys are already pre-sorted, I don't even need a reduce phase. Additionally, since all I need for a web page is 20 entries per page, I can just return the list of keys to my application and load them all at roughly 20ms of time. That's fast enough for me. The thing that makes map/reduce queries slow is 1st: listing keys in a bucket, 2nd is loading records. If you have to load 1,000 records to do your query, it's probably going to be slow. It may take 1ms to fetch and parse 1 record, but 1ms*1,000 is 1sec. Now, I'm working with a small set of entries that will probably never exceed a million keys, so I can store them all in one document. If you need an infinitely scalable index, you will probably have to do something clever in your app like when the index document reaches a certain size, split it, archive the old stuff and use a "next" link to chain the two documents together. I've also seen people mention using B-trees for sorted data, so that's another possibility. Another thing to be cautious of is concurrency when writing the index document. My index document is keyed on user_id and it's updated when the entry is saved by the user. It's written at user speed, so I'm not to worried about concurrency. Depending on your application, you may need to worry about it. An alternative to writing the index on save, is to do it asynchronously using rabbitmq or something along those lines. It all depends on the needs of your application. Take all this with a grain of salt, I'm still a bit of a novice when it comes to modeling data in Riak. Figuring out how to index the data so it facilitates easy look ups was the hardest thing to get used my lazy relational brain used to. I hope that the Basho folks will correct any glaring faults with what I've said. As with everything, test it out for yourself. Resources: http://facility9.com/2010/12/16/secondary-indexes-how-would-you-do-it Eric. On Sat, Jan 15, 2011 at 7:23 PM, Gary William Flake <g...@flake.org> wrote: > I am building a backend for a web service and Riak looks to be a strong fit > for my needs at this point. However, there is one really simple requirement > that I can't figure out how to implement on Riak with any sort of > efficiency. To simplify the question, suppose that I want a twitter-like > service that has billions of smallish documents that enter into the backend > over a period of time. Typical read pattern will be: > 1. List the last N documents > 2. Given a document, see the N that came before this one. > 3. From a random point, see the N documents before this one. > Hence, what I really want is to be able to access the documents in sorted > order by timestamp. I know that I can do all of this with a map/reduce job, > but this solution literally requires the backend to scan each and every > document which seems less than ideal. > Is there a generally accepted "right" way to do this on Riak? Or is there > no known way to do this gracefully, and I need to be prepared to roll my own > secondary indices? Or, can this be handled with a Riak Search range query? > ### > The second topic is concerned with finding an elegant way to address the > first question. One way to solve the problem above is to use a bucket as a > poor man's list. At the application layer, when I insert a new document, I > could create a new k:v pair in my "list" bucket with the key being an > integer corresponding to this document's position in the corpus, and the > value being the actual key in the document bucket. With this scheme, I > could satisfy the scenario above with the following observations: > 1. Once I have the list key for a document, I could find the N before it in > O(N) time which is an improvement over the map/reduce solution which is > O(size of corpus) / O(cluster size). > 2. This solution assumes atomicity that doesn't exist in practice (e.g., the > operations "new id <- number of docs", and "add a new doc with key new id" > have to complete as an atomic operation if multiple clients are attempting > an insert). > 3. This solution also wants a multiget styled function to map index values > to true document keys, and then to get the resulting document. > I know what you are thinking: just use a serialized JSON array as the value > of the list that you want, then just get it, deserialize it, mutate it, then > update its value. But this is horrific because each insertion takes > O(corpus size) time. > Bottom line: life sucks no matter which route we take. > What I think is needed are redis styled objects that support O(1) atomic > operations. For example, if I simply had a k:v pair where the value was a > list which supported insert/delete front/rear, then we are golden. > ### > This brings me to my third question / observation. In scanning over the > archives of this list, I see that others have requested this feature before, > but I haven't seen any mention of the feature being considered. (Please, > please! I hope I am wrong). > So, I want to offer what I think is the minimal feature request that > supports the scenarios that I am trying to achieve. Basically, I want to > build off of links, allowing them to generalize into container types. My > assumptions: > 1. Links have a natural order in that they alway come out in the same order > that they are created (at least that's what the python client API states, > but I can't confirm it in the Riak docs). > 2. Links seem to support mutation operations, meaning that I can add a link > and remove a link in O(1) time (and not in time proportional to the number > of links on the record). > If both of these are true, then all that is needed is to add simple O(1) > mutation operations allowing one to treat a records links as a special > container that references to other records. Insert/delete front/rear will > give you stacks and queues. A small addition to the API would support sets. > ### > That's all for now. I hope someone can chime in with something to add on > top. My real hope is that I've misunderstood the capabilities of Riak and > that what I want to do can be done today. But if that's not the case, I am > curious if others have the same needs and if they know how to work around > them. > Thanks, > -- GWF > > > > > > > > > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com