[jira] [Commented] (LUCENE-3236) Make LowerCaseFilter and StopFilter keyword aware, similar to PorterStemFilter

Bernhard Kraft (JIRA) Tue, 14 Jan 2014 08:03:17 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-3236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870839#comment-13870839
 ]


Bernhard Kraft commented on LUCENE-3236:
----------------------------------------

Maybe this should get handled directly in class "TokenFilter". If I understood 
the concept of "Keywords" correctly they shouldn't get modified by any token 
filter except if it explicitly wants to operate even on tokens marked as 
keyword.

Instead of patching every exiting token filter and expecting a token filter 
developer to know about "KeywordAttribute" I suggest to modify the 
"TokenFilter" class to handle tokens wiht "KeywordAttribute" internally.

As far as I understood the concept of token filters the pattern "Chain of 
responsibility" is used here. Currently it is only the "keywordAttribute" which 
changes the "flow" of a token but eventually those kind of tokens will increase.

My suggestion is shown in the attached diagram. Instead of letting a token 
filter directly call the "incrementToken" method of its previous filter I 
suggest to encapsulate each "real" token filter into a wrapper class which 
delegates calls to defined methods (incrementToken, reset, etc.) if 
appropriate. So it would be possible to create an "keywordAwareTokenFilter" 
interface. If some token filter implements this interface the encapsulating 
class (for example named TokenFilterChainElement, or TokenFilterContainer, ...) 
also calls "incrementToken" when a token with the keyword attribute is 
encountered.

> Make LowerCaseFilter and StopFilter keyword aware, similar to PorterStemFilter
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-3236
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3236
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 4.0-ALPHA
>         Environment: N/A
>            Reporter: Sujit Pal
>            Priority: Minor
>              Labels: analysis
>             Fix For: 4.7
>
>         Attachments: lucene-3236-patch.diff
>
>
> PorterStemFilter has functionality to detect if a term has been marked as a 
> "keyword" by the KeywordMarkerFilter (KeywordAttribute.isKeyword() == true), 
> and if so, skip stemming.
> The suggestion is to have the same functionality in other filters where it is 
> applicable. I think it may be particularly applicable to the LowerCaseFilter 
> (ie if it is a keyword, don't mess with the case), and StopFilter (if it is a 
> keyword, then don't filter it out even if it looks like a stop word).
> Backward compatibility is maintained (in both cases) by adding a new 
> constructor which takes an additional boolean parameter ignoreKeyword. The 
> current constructor will call this new constructor with ignoreKeyword = false.
> Patches are attached (for LowerCaseFilter and StopFilter).
> I have verified that the analysis JUnit tests run against the updated code, 
> ie, backward compatibility is maintained.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-3236) Make LowerCaseFilter and StopFilter keyword aware, similar to PorterStemFilter

Reply via email to