[
https://issues.apache.org/jira/browse/LUCENE-5211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13772174#comment-13772174
]
ASF subversion and git services commented on LUCENE-5211:
---------------------------------------------------------
Commit 1524809 from [email protected] in branch 'dev/trunk'
[ https://svn.apache.org/r1524809 ]
LUCENE-5211: Better javadocs and error checking of 'format' option in
StopFilterFactory, as well as comments in all snowball formated files about
specifying format option
> StopFilterFactory docs do not advertise/explain hte "format" option
> -------------------------------------------------------------------
>
> Key: LUCENE-5211
> URL: https://issues.apache.org/jira/browse/LUCENE-5211
> Project: Lucene - Core
> Issue Type: Bug
> Affects Versions: 4.2
> Reporter: Hayden Muhl
> Assignee: Hoss Man
> Priority: Minor
> Attachments: LUCENE-5211.code.patch,
> LUCENE-5211.stopfilecomments.patch
>
>
> StopFilterFactory supports a "format" option for controlling wether
> "getWordSet" or "getSnowballWordSet" is used to parse the file, but this
> option is not advertised and people can be confused by looking at the example
> stopword files include in the releases (some of which are in the snoball
> format w/ "|" comments) and try to use them w/o explicitly specifying
> {{format="snowball"}} and silently get useless stopwords (that include the "|
> comments" as literal portions of hte stopwrds.
> we need to better document the use of "format" and consider updating all of
> the example stopword files we ship that are in the snowball format with a
> note about the need to use {{format="snowball"}} with those files.
> {panel:title=Initial Bug Report}
> The StopFilterFactory builds a CharArraySet directly from the raw lines of
> the supplied words file. This causes a problem when using the stop word files
> supplied with the Solr/Lucene distribution. In particular, the comments in
> those files get added to the CharArraySet. A line like this...
> ceci | this
> Should result in the string "ceci" being added to the CharArraySet, but "ceci
> | this" is what actually gets added.
> Workaround: Remove all comments from stop word files you are using.
> Suggested fix: The StopFilterFactory should strip any comments, then strip
> trailing whitespace. The stop word files supplied with the distribution
> should be edited to conform to the supported comment format.
> {panel}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]