On 6/30/2020 12:07 PM, Permakoff, Vadim wrote:
Regarding removing the stopwords, I agree, there are many cases when you don't
want to remove the stopwords, but there is one very compelling case when you
want them to be removed.
Imagine, you have one document with the following text:
1. "to expand the methods for mailing cancellation"
And another document with the text:
2. "to expand methods for mailing cancellation"
The user query is (without quotes): q=expand the methods for mailing
cancellation
I don't want to bring all the documents with condition q.op=OR, it will find too many
unrelated documents, so I want to search with q.op=AND. Unfortunately, the document 2
will not be found as it has no stop word "the" in it.
What should I do now?
Do these users want imprecise matches to only show up when there is a
well-known stopword involved, or do they also want imprecise matches to
show up with ANY word missing, added, or moved? If I were betting on
it, I'd say they want the latter, not the former. Erick already gave
you the solution to that -- phrase slop.
In modern times, the only valid reason I can think of to implement a
stopword filter is for situations where you want it to be impossible to
search for certain words -- some might want expletives in this category,
for example.
Tuning a Solr config for good results is an exercise in tradeoffs. The
core tradeoff in most situations is the standard "precision vs. recall"
discussion. A change that increases precision will almost always reduce
recall, and vice versa. I know from experience that you'll get more
complaints about reducing recall than you will about reducing precision.
Implementing a hard-coded phrase slop value of 1 will reduce precision
by an amount that's hard to determine, and GREATLY increase recall.
Chances are good that most users will appreciate the change. If you
make the phrase slop setting configurable by the user, that's even better.
Thanks,
Shawn