Hi Uwe, Thanks for the reply. I'm not familiar with the usage of Lucene so any help would be appreciated.
In our test we are executing several consecutive stemming operations (exception is thrown when the second stemmer.stem() method is called). In the code, see below, it does call the reset() method but like you say it could be called at the wrong place. @Test public void stem() { LuceneStemmer stemmer = new LuceneStemmer(); assertEquals("thing", stemmer.stem("thing")); assertEquals("thing", stemmer.stem("things")); assertEquals("genius", stemmer.stem("geniuses")); assertEquals("fri", stemmer.stem("fries")); assertEquals("gentli", stemmer.stem("gently")); } --- LuceneStemmer class --- import java.io.IOException; import java.io.StringReader; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.en.PorterStemFilter; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; public class LuceneStemmer implements Stemmer { /** Analyzer that tokenizes on whitespace, and lower-cases and stems words */ private Analyzer analyzer = new StemmingAnalyzer(); /** * Returns version of text with all words lower-cased and stemmed * @param text String to stem * @return stemmed text */ @Override public String stem(String text) { StringBuilder builder = new StringBuilder(); try { TokenStream tokenStream = analyzer.tokenStream(null, new StringReader(text)); tokenStream.reset(); CharTermAttribute termAttribute = tokenStream.getAttribute(CharTermAttribute.class); while (tokenStream.incrementToken()) { if (builder.length() > 0) { builder.append(' '); } builder.append(termAttribute.toString()); } } catch (IOException e) { // shouldn't happen reading from a StringReader, but you never know throw new RuntimeException(e.getMessage(), e); } return builder.toString(); } } --- StemmingAnalyzer class ---- import com.google.common.collect.Sets; import java.io.Reader; import java.util.Collections; import java.util.Set; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.Tokenizer; import org.apache.lucene.analysis.core.LowerCaseFilter; import org.apache.lucene.analysis.core.StopFilter; import org.apache.lucene.analysis.core.WhitespaceTokenizer; import org.apache.lucene.analysis.en.PorterStemFilter; import org.apache.lucene.analysis.util.CharArraySet; import static com.adacado.Constants.*; public final class StemmingAnalyzer extends Analyzer { private Set<String> stopWords; public StemmingAnalyzer() { this.stopWords = Collections.EMPTY_SET; } public StemmingAnalyzer(Set<String> stopWords) { this.stopWords = stopWords; } public StemmingAnalyzer(String... stopWords) { this.stopWords = Sets.newHashSet(stopWords); } @Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { Tokenizer source = new WhitespaceTokenizer(LUCENE_VERSION, reader); TokenStream filter = new StopFilter(LUCENE_VERSION, new PorterStemFilter(new LowerCaseFilter(LUCENE_VERSION, source)), CharArraySet.copy(LUCENE_VERSION, stopWords)); return new TokenStreamComponents(source, filter); } } Stack trace java.lang.IllegalStateException: TokenStream contract violation: close() call missing at org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:89) at org.apache.lucene.analysis.Analyzer$TokenStreamComponents.setReader(Analyzer.java:307) at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:145) at LuceneStemmer.stem(LuceneStemmer.java:28) at LuceneStemmerTest.stem(LuceneStemmerTest.java:16) Thanks. Regards, Joe On Thu, Mar 20, 2014 at 1:40 PM, Uwe Schindler <u...@thetaphi.de> wrote: > Hi Joe, > > in Lucene 4.6, the TokenStream/Tokenizer APIs got some additional state > machine checks to ensure that consumers and subclasses of those abstract > interfaces are implemented in a correct way - they are not easy to > understand, because they are implemented in that way to ensure they don't > affect performance. If your test case consumes the Tokenizer/TokenStream in > a wrong way (e.g. missing to call reset() or setReader() at correct > places), an IllegalStateException is thrown. The ILLEGAL_STATE_READER is > there to ensure that the consumer gets a correct exception if it calls > setReader() or reset() in the wrong order (or multiple times). > > The checks in the base class are definitely OK, if you hit the > IllegalStateException, your have some problems in your implementation of > the Tokenizer/TokenStream interface (e.g. missing super() calls or calling > reset() from inside setReader() or whatever). Or, the consumer does not > respect the full documented workflow: > http://lucene.apache.org/core/4_6_1/core/org/apache/lucene/analysis/TokenStream.html > > If you have TokenFilters in your analysis chain, the source of error may > also be missing super delegations in reset(), end(),... If you need further > help, post your implementation of the consumer in your test case or post > your analysis chain and custom Tokenizers. You may also post the stack > trace in addition, because this helps to find out what call sequence you > have. > > Uwe > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > -----Original Message----- > > From: Joe Wong [mailto:jw...@adacado.com] > > Sent: Thursday, March 20, 2014 8:58 PM > > To: java-user@lucene.apache.org > > Subject: Possible issue with Tokenizer in lucene-analyzers-common-4.6.1 > > > > Hi > > > > We're planning to upgrade lucene-analyzers-commons 4.3.0 to 4.6.1 . > While > > running our unit test with 4.6.1 it fails at > > org.apache.lucene.analysis.Tokenizer on line 88 (setReader method). There > > it checks if input != ILLEGAL_STATE_READER then throws > > IllegalStateException. Should it not be if input == ILLEGAL_STATE_READER? > > > > Regards, > > Joe > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >