DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT <http://issues.apache.org/bugzilla/show_bug.cgi?id=28960>. ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND INSERTED IN THE BUG DATABASE.
http://issues.apache.org/bugzilla/show_bug.cgi?id=28960 Add "an" to the English stop words [EMAIL PROTECTED] changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Resolution|FIXED | ------- Additional Comments From [EMAIL PROTECTED] 2004-05-20 16:52 ------- This is a can of worms I'm hesitant to open. If we add "an" then we'll be asked to add "its", and if we add "its" we'll be asked to add "do", and so on. This stop list was originally generated by looking at the most frequent terms in a collection. I guess "an" was less frequent than "a" or any other word in that collection. There are other, better, ways to define stop lists, but I don't think the Lucene project should be the business of providing high-quality stop lists. The Snowball project is a much better place for that sort of activity. If you want a good, big, English stop list, grab: http://snowball.tartarus.org/english/stop.txt I think the best long-term fix for this is to extend the Snowball library in the sandbox (http://jakarta.apache.org/lucene/docs/lucene-sandbox/snowball/) so that it provides StopFilters for each of the stop lists provided by Snowball. Once we do this, we can deprecate uses of StopFilter and StopAnalysis that do not specify a custom stop list. The deprecation documentation can point folks to the Snowball stop filters. How does that sound? Any volunteers to implement Snowball-based StopFilters? I think this could just be a static method, something like: public static StopFilter getStopFilter(String language); The implementation could use ClasssLoader.getResource() to find a stop list file packaged in the jar file, then parse the file and construct a StopFilter from it. It should probably also cache these, so that every call doesn't re-parse the file. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
