My company is starting to use Riak for document storage. I'm pretty happy about how it has been working so far, but I see the messages of foreboding and doom out there about Riak Search and I've encountered a problem myself.
I can't really avoid using Riak Search, as full text indexing is a key feature we need to provide. If Riak Search is suboptimal, so is basically every other text index out there. We've just been burned by ElasticSearch's ineffective load balancing (who would have guessed, consistent hashing is kind of important). I know that performing searches in Riak Search that return many thousands of documents is discouraged for performance reasons, and the developers encourage removing stopwords to help with this. There's additionally, I have seen, a hard limit on the number of documents that can be examined by a search query; if any term matches more than 100,000 documents, the query will return a too_many_results error (and, incidentally, things will get so confused that, in the Python client, the *next* query will also fail with an HTTP error 400). The question is, what should I actually do to avoid this case? I've already removed the usual stopwords, but any particular set of documents might have its own personal stopwords. For example, in a database of millions of hotel reviews, the word 'hotel' could easily appear in more than 100,000 documents. If we need to search for '5-star hotel', it's wasteful and probably crash-prone to retrieve all the 'hotel' results. What I'd really like to do is just search for '5-star', which because of IDF scoring will have about the same effect. That requires knowing somehow that the word 'hotel' appears in too many documents. Is there a way to determine, via Riak, which terms are overused so I can remove them from search queries? Or do I need to keep track of this entirely on the client end so I can avoid searching for those terms? Thanks, -- Rob Speer
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
