My company is starting to use Riak for document storage. I'm pretty happy
about how it has been working so far, but I see the messages of foreboding
and doom out there about Riak Search and I've encountered a problem myself.

I can't really avoid using Riak Search, as full text indexing is a key
feature we need to provide. If Riak Search is suboptimal, so is basically
every other text index out there. We've just been burned by ElasticSearch's
ineffective load balancing (who would have guessed, consistent hashing is
kind of important).

I know that performing searches in Riak Search that return many thousands
of documents is discouraged for performance reasons, and the developers
encourage removing stopwords to help with this. There's additionally, I
have seen, a hard limit on the number of documents that can be examined by
a search query; if any term matches more than 100,000 documents, the query
will return a too_many_results error (and, incidentally, things will get so
confused that, in the Python client, the *next* query will also fail with
an HTTP error 400).

The question is, what should I actually do to avoid this case? I've already
removed the usual stopwords, but any particular set of documents might have
its own personal stopwords. For example, in a database of millions of hotel
reviews, the word 'hotel' could easily appear in more than 100,000
documents.

If we need to search for '5-star hotel', it's wasteful and probably
crash-prone to retrieve all the 'hotel' results. What I'd really like to do
is just search for '5-star', which because of IDF scoring will have about
the same effect. That requires knowing somehow that the word 'hotel'
appears in too many documents.

Is there a way to determine, via Riak, which terms are overused so I can
remove them from search queries? Or do I need to keep track of this
entirely on the client end so I can avoid searching for those terms?

Thanks,
-- Rob Speer
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to