[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a Set instead of CharArraySet

Robert Muir (JIRA) Wed, 24 Feb 2010 04:36:51 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837759#action_12837759
 ]


Robert Muir commented on LUCENE-2279:
-------------------------------------

bq. Yet, as an analyzer developer you really want to use the new 
ReusableAnalyzerBase in favor of Analyzer in 99% of the cases and it will 
require you writing half of the code plus gives you reusability of the 
tokenStream

and the 1% extremely advanced cases that can't reuse, can just use TokenStreams 
directly when indexing, e.g. the Analyzer class could be reusable by 
definition. we shouldnt let these obscure cases slow down everyone else.

bq. It assumes both #reusabelTokenStream and #tokenStream to be final

in my opinion all the core analyzers (you already fixed contrib) should be 
final. this is another trap, if you subclass one of these analyzers and 
implement 'tokenStream', its immediately slow due to the backwards code.

bq. I think Lucene/Solr/Nutch need to eventually get to this point

if this is what we should do to remove the code duplication, then i am all for 
it. i still don't quite understand how it gives us more freedom to break/change 
the APIs, i mean however we label this stuff, a break is a break to the user at 
the end of the day.

> eliminate pathological performance on StopFilter when using a Set<String> 
> instead of CharArraySet
> -------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2279
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2279
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: thushara wijeratna
>            Priority: Minor
>
> passing a Set<Srtring> to a StopFilter instead of a CharArraySet results in a 
> very slow filter.
> this is because for each document, Analyzer.tokenStream() is called, which 
> ends up calling the StopFilter (if used). And if a regular Set<String> is 
> used in the StopFilter all the elements of the set are copied to a 
> CharArraySet, as we can see in it's ctor:
> public StopFilter(boolean enablePositionIncrements, TokenStream input, Set 
> stopWords, boolean ignoreCase)
>   {
>     super(input);
>     if (stopWords instanceof CharArraySet) {
>       this.stopWords = (CharArraySet)stopWords;
>     } else {
>       this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
>       this.stopWords.addAll(stopWords);
>     }
>     this.enablePositionIncrements = enablePositionIncrements;
>     init();
>   }
> i feel we should make the StopFilter signature specific, as in specifying 
> CharArraySet vs Set, and there should be a JavaDoc warning on using the other 
> variants of the StopFilter as they all result in a copy for each invocation 
> of Analyzer.tokenStream().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a Set instead of CharArraySet

Reply via email to