[ https://issues.apache.org/jira/browse/LUCENE-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801446#action_12801446 ]
Simon Willnauer commented on LUCENE-2198: ----------------------------------------- I kind of agree with both of you. When I started implementing this attribute I had FlagAttribute in mind but I didn't choose it because users can randomly choose a bit of the word which might lead to unexpected behavior. Another solution I had in mind is to introduce another Attribute (or extend FlagAttribute) holding a Lucene private (not the java visibility keyword) Enum that can be extended in the future. Internally this could use a word or a Bitset (a word will do I guess) where bits can be set according to the enum ord. That way we could encode way more than only one single boolean and the cost of adding new "flags" / enum values would be minimal. {code} booleanAttribute.isSet(BooelanAttributeEnum.Keyword) {code} something like that, thoughts? > support protected words in Stemming TokenFilters > ------------------------------------------------ > > Key: LUCENE-2198 > URL: https://issues.apache.org/jira/browse/LUCENE-2198 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Affects Versions: 3.0 > Reporter: Robert Muir > Priority: Minor > Attachments: LUCENE-2198.patch, LUCENE-2198.patch > > > This is from LUCENE-1515 > I propose that all stemming TokenFilters have an 'exclusion set' that > bypasses any stemming for words in this set. > Some stemming tokenfilters have this, some do not. > This would be one way for Karl to implement his new swedish stemmer (as a > text file of ignore words). > Additionally, it would remove duplication between lucene and solr, as they > reimplement snowballfilter since it does not have this functionality. > Finally, I think this is a pretty common use case, where people want to > ignore things like proper nouns in the stemming. > As an alternative design I considered a case where we generalized this to > CharArrayMap (and ignoring words would mean mapping them to themselves), > which would also provide a mechanism to override the stemming algorithm. But > I think this is too expert, could be its own filter, and the only example of > this i can find is in the Dutch stemmer. > So I think we should just provide ignore with CharArraySet, but if you feel > otherwise please comment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org