On 6/30/2020 12:07 PM, Permakoff, Vadim wrote:
Regarding removing the stopwords, I agree, there are many cases when you don't 
want to remove the stopwords, but there is one very compelling case when you 
want them to be removed.

Imagine, you have one document with the following text:
1. "to expand the methods for mailing cancellation"
And another document with the text:
2. "to expand methods for mailing cancellation"

The user query is (without quotes): q=expand the methods for mailing 
cancellation
I don't want to bring all the documents with condition q.op=OR, it will find too many 
unrelated documents, so I want to search with q.op=AND. Unfortunately, the document 2 
will not be found as it has no stop word "the" in it.
What should I do now?

Do these users want imprecise matches to only show up when there is a well-known stopword involved, or do they also want imprecise matches to show up with ANY word missing, added, or moved? If I were betting on it, I'd say they want the latter, not the former. Erick already gave you the solution to that -- phrase slop.

In modern times, the only valid reason I can think of to implement a stopword filter is for situations where you want it to be impossible to search for certain words -- some might want expletives in this category, for example.

Tuning a Solr config for good results is an exercise in tradeoffs. The core tradeoff in most situations is the standard "precision vs. recall" discussion. A change that increases precision will almost always reduce recall, and vice versa. I know from experience that you'll get more complaints about reducing recall than you will about reducing precision. Implementing a hard-coded phrase slop value of 1 will reduce precision by an amount that's hard to determine, and GREATLY increase recall. Chances are good that most users will appreciate the change. If you make the phrase slop setting configurable by the user, that's even better.

Thanks,
Shawn

Reply via email to