[ https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837792#action_12837792 ]
Michael McCandless commented on LUCENE-2279: -------------------------------------------- bq. I would stop right here and ask to discuss it on the dev list, thoughts mike?! Agreed... I'll start a thread. {quote} bq. Maybe we should absorb ReusableAnalyzerBase back into Analyzer? That would be the logical consequence but the problem with ReusableAnalyzerBase is that it will break bw comapt if moved to Analyzer. {quote} Right, this is why I was thinking if we make a new analyzers package, it's a chance to break/improve things. We'd have a single abstract base class that only exposes reuse API. bq. in my opinion all the core analyzers (you already fixed contrib) should be final. I agree, and we should consistently take this approach w/ the new analyzers package... bq. i still don't quite understand how it gives us more freedom to break/change the APIs, i mean however we label this stuff, a break is a break to the user at the end of the day. Because it'd be an entirely new package, so we can create a new base Analyzer class (in that package) that breaks/fixes things when compared to Lucene's Analyzer class. We'd eventually deprecate the analyzers/tokenizers/token filters in Lucene/Solr/Nutch in favor of this new package, and users can switch over on their own schedule. > eliminate pathological performance on StopFilter when using a Set<String> > instead of CharArraySet > ------------------------------------------------------------------------------------------------- > > Key: LUCENE-2279 > URL: https://issues.apache.org/jira/browse/LUCENE-2279 > Project: Lucene - Java > Issue Type: Improvement > Reporter: thushara wijeratna > Priority: Minor > > passing a Set<Srtring> to a StopFilter instead of a CharArraySet results in a > very slow filter. > this is because for each document, Analyzer.tokenStream() is called, which > ends up calling the StopFilter (if used). And if a regular Set<String> is > used in the StopFilter all the elements of the set are copied to a > CharArraySet, as we can see in it's ctor: > public StopFilter(boolean enablePositionIncrements, TokenStream input, Set > stopWords, boolean ignoreCase) > { > super(input); > if (stopWords instanceof CharArraySet) { > this.stopWords = (CharArraySet)stopWords; > } else { > this.stopWords = new CharArraySet(stopWords.size(), ignoreCase); > this.stopWords.addAll(stopWords); > } > this.enablePositionIncrements = enablePositionIncrements; > init(); > } > i feel we should make the StopFilter signature specific, as in specifying > CharArraySet vs Set, and there should be a JavaDoc warning on using the other > variants of the StopFilter as they all result in a copy for each invocation > of Analyzer.tokenStream(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org