Re: newbie questions: sorted keys, ADT primitives, and link manipulation

Eric Moritz Sun, 16 Jan 2011 00:42:15 -0800

First, check out Riak Search.  It can sort your records by dates and
your can use it for
filtering on secondary indexes.  If your criteria for listing the
records is pretty simple and can be expressed with a search query,
that may be your best bet.  For instance, listing a user's blog
entries is as simple as "author_id:some-uuid-or-something".

Another approach, which I have had to take, is to keep a single index
document that is a list of keys in the order I want them in.  I then
use a map/reduce query to load, parse and slice the document in the
map phase and then return a list of keys to my application.

This is pretty fast because it's all happening in the node the data is
stored on and therefore I don't have to worry about the network
latency of transferring what could be a giant document.  Since the
keys are already pre-sorted, I don't even need a reduce phase.

Additionally, since all I need for a web page is 20 entries per page,
I can just return the list of keys to my application and load them all
at roughly 20ms of time.  That's fast enough for me.

The thing that makes map/reduce queries slow is 1st: listing keys in a
bucket, 2nd is loading records.  If you have to load 1,000 records to
do your query, it's probably going to be slow.  It may take 1ms to
fetch and parse 1 record, but 1ms*1,000 is 1sec.

Now, I'm working with a small set of entries that will probably never
exceed a million keys, so I can store them all in one document. If you
need an infinitely scalable index, you will probably have to do
something clever in your app like when the index document reaches a
certain size, split it, archive the old stuff and use a "next" link to
chain the two documents together. I've also seen people mention using
B-trees for sorted data, so that's another possibility.

Another thing to be cautious of is concurrency when writing the index
document.  My index document is keyed on user_id and it's updated when
the entry is saved by the user.  It's written at user speed, so I'm
not to worried about concurrency.  Depending on your application, you
may need to worry about it. An alternative to writing the index on
save, is to do it asynchronously using rabbitmq or something along
those lines.  It all depends on the needs of your application.

Take all this with a grain of salt, I'm still a bit of a novice when
it comes to modeling data in Riak.  Figuring out how to index the data
so it facilitates easy look ups was the hardest thing to get used my
lazy relational brain used to.  I hope that the Basho folks will
correct any glaring faults with what I've said.  As with everything,
test it out for yourself.

Resources:
http://facility9.com/2010/12/16/secondary-indexes-how-would-you-do-it

Eric.

On Sat, Jan 15, 2011 at 7:23 PM, Gary William Flake <g...@flake.org> wrote:
> I am building a backend for a web service and Riak looks to be a strong fit
> for my needs at this point.  However, there is one really simple requirement
> that I can't figure out how to implement on Riak with any sort of
> efficiency.  To simplify the question, suppose that I want a twitter-like
> service that has billions of smallish documents that enter into the backend
> over a period of time.  Typical read pattern will be:
> 1. List the last N documents
> 2. Given a document, see the N that came before this one.
> 3. From a random point, see the N documents before this one.
> Hence, what I really want is to be able to access the documents in sorted
> order by timestamp.  I know that I can do all of this with a map/reduce job,
> but this solution literally requires the backend to scan each and every
> document which seems less than ideal.
> Is there a generally accepted "right" way to do this on Riak?  Or is there
> no known way to do this gracefully, and I need to be prepared to roll my own
> secondary indices?  Or, can this be handled with a Riak Search range query?
> ###
> The second topic is concerned with finding an elegant way to address the
> first question.  One way to solve the problem above is to use a bucket as a
> poor man's list.  At the application layer, when I insert a new document, I
> could create a new k:v pair in my "list" bucket with the key being an
> integer corresponding to this document's position in the corpus, and the
> value being the actual key in the document bucket.  With this scheme, I
> could satisfy the scenario above with the following observations:
> 1. Once I have the list key for a document, I could find the N before it in
> O(N) time which is an improvement over the map/reduce solution which is
> O(size of corpus) / O(cluster size).
> 2. This solution assumes atomicity that doesn't exist in practice (e.g., the
> operations "new id <- number of docs", and "add a new doc with key new id"
> have to complete as an atomic operation if multiple clients are attempting
> an insert).
> 3. This solution also wants a multiget styled function to map index values
> to true document keys, and then to get the resulting document.
> I know what you are thinking: just use a serialized JSON array as the value
> of the list that you want, then just get it, deserialize it, mutate it, then
> update its value.  But this is horrific because each insertion takes
> O(corpus size) time.
> Bottom line: life sucks no matter which route we take.
> What I think is needed are redis styled objects that support O(1) atomic
> operations.  For example, if I simply had a k:v pair where the value was a
> list which supported insert/delete front/rear, then we are golden.
> ###
> This brings me to my third question / observation.  In scanning over the
> archives of this list, I see that others have requested this feature before,
> but I haven't seen any mention of the feature being considered.  (Please,
> please!  I hope I am wrong).
> So, I want to offer what I think is the minimal feature request that
> supports the scenarios that I am trying to achieve.  Basically, I want to
> build off of links, allowing them to generalize into container types.  My
> assumptions:
> 1. Links have a natural order in that they alway come out in the same order
> that they are created (at least that's what the python client API states,
> but I can't confirm it in the Riak docs).
> 2. Links seem to support mutation operations, meaning that I can add a link
> and remove a link in O(1) time (and not in time proportional to the number
> of links on the record).
> If both of these are true, then all that is needed is to add simple O(1)
> mutation operations allowing one to treat a records links as a special
> container that references to other records.  Insert/delete front/rear will
> give you stacks and queues.  A small addition to the API would support sets.
> ###
> That's all for now.  I hope someone can chime in with something to add on
> top.  My real hope is that I've misunderstood the capabilities of Riak and
> that what I want to do can be done today.  But if that's not the case, I am
> curious if others have the same needs and if they know how to work around
> them.
> Thanks,
> -- GWF
>
>
>
>
>
>
>
>
>
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: newbie questions: sorted keys, ADT primitives, and link manipulation

Reply via email to