[ https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
thushara wijeratna updated LUCENE-2279: --------------------------------------- a yourkit profile (before/after): http://thushw.blogspot.com/2010/02/interesting-performance-characteristic.html > eliminate pathological performance on StopFilter when using a Set<String> > instead of CharArraySet > ------------------------------------------------------------------------------------------------- > > Key: LUCENE-2279 > URL: https://issues.apache.org/jira/browse/LUCENE-2279 > Project: Lucene - Java > Issue Type: Improvement > Reporter: thushara wijeratna > > passing a Set<Srtring> to a StopFilter instead of a CharArraySet results in a > very slow filter. > this is because for each document, Analyzer.tokenStream() is called, which > ends up calling the StopFilter (if used). And if a regular Set<String> is > used in the StopFilter all the elements of the set are copied to a > CharArraySet, as we can see in it's ctor: > public StopFilter(boolean enablePositionIncrements, TokenStream input, Set > stopWords, boolean ignoreCase) > { > super(input); > if (stopWords instanceof CharArraySet) { > this.stopWords = (CharArraySet)stopWords; > } else { > this.stopWords = new CharArraySet(stopWords.size(), ignoreCase); > this.stopWords.addAll(stopWords); > } > this.enablePositionIncrements = enablePositionIncrements; > init(); > } > i feel we should make the StopFilter signature specific, as in specifying > CharArraySet vs Set, and there should be a JavaDoc warning on using the other > variants of the StopFilter as they all result in a copy for each invocation > of Analyzer.tokenStream(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org