[ https://issues.apache.org/jira/browse/LUCENE-5211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13767145#comment-13767145 ]
Hoss Man commented on LUCENE-5211: ---------------------------------- The StopFilterFactory supports two different "formats" of stop word files, the default format that has been supported since day #1 allows comments using "#", but more recently support was added for the "snowball" stopword format which is what is used in the stopwords_fr.txt file you seem to be refering to. the example usage of stopwords_fr.txt in solr explicitly configures the StopFilterFactory so that it knows the file is in the "smowball" format... {noformat} <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_fr.txt" format="snowball" /> {noformat} So there doesn't seem to any functionaly bug here -- just a documntation issue: when support was added for the "snowball" format, it appears that nothing was added to the class javadocs of hte factory to make this clear. If no one beats me to it, i'll clean this up next week. > StopFilterFactory does not honor comments > ----------------------------------------- > > Key: LUCENE-5211 > URL: https://issues.apache.org/jira/browse/LUCENE-5211 > Project: Lucene - Core > Issue Type: Bug > Components: core/search > Affects Versions: 4.2 > Reporter: Hayden Muhl > > The StopFilterFactory builds a CharArraySet directly from the raw lines of > the supplied words file. This causes a problem when using the stop word files > supplied with the Solr/Lucene distribution. In particular, the comments in > those files get added to the CharArraySet. A line like this... > ceci | this > Should result in the string "ceci" being added to the CharArraySet, but "ceci > | this" is what actually gets added. > Workaround: Remove all comments from stop word files you are using. > Suggested fix: The StopFilterFactory should strip any comments, then strip > trailing whitespace. The stop word files supplied with the distribution > should be edited to conform to the supported comment format. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org