Our team has a fork of the riaksearch source that we believe adds a very simple 
but highly desirable improvement.  We've reached out to Basho directly with the 
idea but, as they say, code talks.  The fork can be found at 
https://github.com/gpascale/riak_search and we will be issuing a pull request 
today for it.  Moreover, we have a single line fork of riak-js as well located 
at https://github.com/gpascale/riak-js.

In our use case, we need to access documents from a search index in time based 
order (not relevance).  Moreover, we need to paginate over these results.  As 
some of you know, bug 867 - https://issues.basho.com/show_bug.cgi?id=867 - 
prevents sort, start, and rows from being correctly applied.  The current 
blessed work around is to write a M/R job to handle it.  We found, however, 
that the degradation in performance for having to do a M/R for each query was 
problematic.

For very typical scenarios where you need to jump to a slice of results that 
are ordered by something besides relevance, we are seeing a 10-100x performance 
gain per query within our fork.  We believe the technique we've implemented can 
be used in several other examples.  For example:

1. You need a bucket to act like an enormous container or queue.  With our fork 
you can quickly ask for the most (or least) recent N items, process them, then 
remove them.

2. Your data is hierarchical (like employees in a company) and you need search 
results to be ordered by seniority within the hierarchy. 

3. You are building a twitter clone and you want search of tweets to show the 
most recent.

4. Your data has a "natural" order such as price or amount that is always 
preferred to TFIDF based relevance.

If you have one of these scenarios, I would encourage you to read further.  In 
the rest of this note I'll explain (A) why bug 867 is a show stopper for you, 
(B) a representation hack that you can use to speed up your M/R calls if you 
want to work around this with the current builds of riaksearch, and (C) show 
you how to use our fork to see the most significant performance gains.

(A). Bug 867

Suppose you have a query that will normally return 300 results.  You desire the 
results to be ordered by a key within your results.  Moreover, you want the 
middle 100 results after the sort has been applied.  The semantics of the SOLR 
API for riaksearch are such that start=100&rows=100&sort=KEY is supposed to do 
the job.

Bug 867 prevents this from working because riaksearch performs the slice and 
sort in the wrong order, yielding you 100 results, but the 100 that are in the 
middle of the results if they are sorted by relevance and THEN reordered by KEY 
afterwards.

Fixing this bug is non-trivial because the most straightforward implementation 
requires one to retrieve all of the documents in order to perform the sort by 
KEY, whereas the relevance sort only needs the information in the the index.

The work around for this issue is to do something like the following (shown in 
riak-js):

> db.addSearch(bucketName, query)
>     .map('Riak.mapValuesJson')
>     .reduce('Riak.reduceSort', 'function(a,b){return a.key-b.key;}')
>     .reduce('Riak.reduceSlice', [start, start + count])
>     .run(callback);

Here, you are seeding a M/R phase with the search results, pulling out the 
records for all results, doing the sort as a reduce with your own comparison 
function, then doing the slice.

For a query that returns a few thousand results, you can expect such a M/R job 
to take a couple of seconds on typical hardware, which is hardly acceptable for 
interactive applications.


(B). A representation hack

The M/R phase above suffers in that it has to retrieve the documents for all 
valid results, not just the ones that are part of the final sliced result set.

We can improve upon this by putting our sort key as a prefix to the document 
ID.  Care must be taken to properly pad the keys so that they are of fixed 
length.  Moreover, you'll want your synthetic document keys to have something 
unique as well.  We set our document iDs to be the time stamp of object 
creation (properly padded) plus a GUID.  Thus, each document is unique, but 
sorts over keys do what we want.  With this change, we can rewrite the above 
M/R to:

> db.addSearch(bucketName, query)
>     .map('function (v) { return [v.key]; }')
>     .reduce('Riak.reduceSort', 'function(a,b){return (a)-(b)}')
>     .reduce('Riak.reduceSlice', [start, start + count])
>     .run(callback);


The difference is that our new map() does not pull out all of the documents.  
Instead, this M/R completes with with a list of documents that are in the final 
result (which we may still need to retrieve for our application).

We typically see 3-10x performance gains over the previous method, but this is 
highly dependent on typical result sizes as well as typical document sizes.


(C).  Our fork of Riak search.

The change we've introduced today takes the idea behind these synthetic keys, 
but puts all of the work back into the search core, thereby eliminating the M/R.

Where the search core performs a sort by relevance (prior to doing the slice 
specified by the SOLR arguments), we replace this with a sort by key.  That's 
the entire essence of the change.

By avoiding the M/R in it's entirety, we now often see performance gains of 
3-10x over the previous method and, therefore, 10-100x gains over the most 
simplistic approach described first.

We've also introduced an option called "presort" which can be be set to "key" 
or "score".  The former triggers our new behavior.  The latter preserves the 
original behavior.  Leaving the option unspecified is the same as setting it to 
"score", so our fork should not introduce any breaking changes.


Anyhow, these have turned out to be essential changes for us, which is why we 
wanted to make them available to others.

Best,
-- GWF

Founder, Clipboard, Inc.
 




_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to