[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a Set instead of CharArraySet

Michael McCandless (JIRA) Wed, 24 Feb 2010 06:06:55 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837792#action_12837792
 ]


Michael McCandless commented on LUCENE-2279:
--------------------------------------------

bq. I would stop right here and ask to discuss it on the dev list, thoughts 
mike?!

Agreed... I'll start a thread.

{quote}
bq. Maybe we should absorb ReusableAnalyzerBase back into Analyzer?

That would be the logical consequence but the problem with ReusableAnalyzerBase 
is that it will break bw comapt if moved to Analyzer.
{quote}

Right, this is why I was thinking if we make a new analyzers package, it's a 
chance to break/improve things.  We'd have a single abstract base class that 
only exposes reuse API.

bq. in my opinion all the core analyzers (you already fixed contrib) should be 
final. 

I agree, and we should consistently take this approach w/ the new analyzers 
package...

bq. i still don't quite understand how it gives us more freedom to break/change 
the APIs, i mean however we label this stuff, a break is a break to the user at 
the end of the day.

Because it'd be an entirely new package, so we can create a new base Analyzer 
class (in that package) that breaks/fixes things when compared to Lucene's 
Analyzer class.

We'd eventually deprecate the analyzers/tokenizers/token filters in 
Lucene/Solr/Nutch in favor of this new package, and users can switch over on 
their own schedule.


> eliminate pathological performance on StopFilter when using a Set<String> 
> instead of CharArraySet
> -------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2279
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2279
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: thushara wijeratna
>            Priority: Minor
>
> passing a Set<Srtring> to a StopFilter instead of a CharArraySet results in a 
> very slow filter.
> this is because for each document, Analyzer.tokenStream() is called, which 
> ends up calling the StopFilter (if used). And if a regular Set<String> is 
> used in the StopFilter all the elements of the set are copied to a 
> CharArraySet, as we can see in it's ctor:
> public StopFilter(boolean enablePositionIncrements, TokenStream input, Set 
> stopWords, boolean ignoreCase)
>   {
>     super(input);
>     if (stopWords instanceof CharArraySet) {
>       this.stopWords = (CharArraySet)stopWords;
>     } else {
>       this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
>       this.stopWords.addAll(stopWords);
>     }
>     this.enablePositionIncrements = enablePositionIncrements;
>     init();
>   }
> i feel we should make the StopFilter signature specific, as in specifying 
> CharArraySet vs Set, and there should be a JavaDoc warning on using the other 
> variants of the StopFilter as they all result in a copy for each invocation 
> of Analyzer.tokenStream().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a Set instead of CharArraySet

Reply via email to