I often recommend against stop word removal altogether. Is there any
reason you need to remove them?
The primary reason stop words get removed is to increase performance
of queries with very common terms. If you are encountering that,
using Solr's CommonGramsFilter(Factory) is a good solution to keep
your stop words and alleviate the performance degradation potential.
The HathiTrust folks have had success with the common grams capability.
Erik
On Nov 11, 2009, at 3:41 PM, Eric James wrote:
Has anyone already given some thought into refining the solr
stopwords.txt for library collections, particularly finding aids?
The words included in the out of the box stopwords.txt are of very
questionable unimportance:
<an and are as at be but by for if in into is it not of on or s such
t that the their then there these they this to was will with>
We were indexing a field id with "no." as one of its tokens (for
number), but wanted a query with "no" (where the person did not add
the period) to find the doc, but in actuality the "no" would get
stripped by the StopFilterFactory. And thus we stumbled upon this
list, and was a bit suprised by some of the inclusions (ex:"will"),
and exclusions( ex:"a").
Thanks,
Eric James
Yale University Libraries