[jira] Commented: (LUCENE-2198) support protected words in Stemming TokenFilters

Uwe Schindler (JIRA) Sun, 17 Jan 2010 09:46:17 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801448#action_12801448
 ]


Uwe Schindler commented on LUCENE-2198:
---------------------------------------

My problem with FlagsAttribute is missing "type safety". You have to choose an 
bit mask for "your" attribute but another TokenFilter in your streanm could use 
the same bit mask. So in my opinion, FlagsAttribute should be deprectated and 
replaced by simple boolean attributes everybody can define type safe.

About the speed of cloning: Clone was slow in old java versions, but now it is 
done directly in the JVM. Cloning an Attribute using the following code is much 
slower than invoking clone() [please note, not because of the reflections, it 
there to show how it should be implemented):

{code}
AttributeImpl clone = this.getClass().newInstance();
this.copyTo(clone);
return clone;
{code}

Michi Busch and me are currently investigating fast Attribute Proxies (needed 
for flex MultiEnums) and also fast capturing of states using CGLIB.

> support protected words in Stemming TokenFilters
> ------------------------------------------------
>
>                 Key: LUCENE-2198
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2198
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2198.patch, LUCENE-2198.patch
>
>
> This is from LUCENE-1515
> I propose that all stemming TokenFilters have an 'exclusion set' that 
> bypasses any stemming for words in this set.
> Some stemming tokenfilters have this, some do not.
> This would be one way for Karl to implement his new swedish stemmer (as a 
> text file of ignore words).
> Additionally, it would remove duplication between lucene and solr, as they 
> reimplement snowballfilter since it does not have this functionality.
> Finally, I think this is a pretty common use case, where people want to 
> ignore things like proper nouns in the stemming.
> As an alternative design I considered a case where we generalized this to 
> CharArrayMap (and ignoring words would mean mapping them to themselves), 
> which would also provide a mechanism to override the stemming algorithm. But 
> I think this is too expert, could be its own filter, and the only example of 
> this i can find is in the Dutch stemmer.
> So I think we should just provide ignore with CharArraySet, but if you feel 
> otherwise please comment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2198) support protected words in Stemming TokenFilters

Reply via email to