Thanks Uwe. It worked.
On Thu, Mar 20, 2014 at 3:28 PM, Uwe Schindler <u...@thetaphi.de> wrote: > Hi, > > the IllegalStateException tells you what's wrong: "TokenStream contract > violation: close() call missing" > > Analyzer internally reuses TokenStreams, so if you call > Analyzer.tokenStream() a second time it will return the same instance of > your TokenStream. On that second call the state machine detects that it was > not closed before, which is easy to see: > Your consumer never closes the tokenstream returned by the analyzer after > it finishes the incrementToken() loop. This is why I said that you have to > follow the official consuming workflow as described on the TokenStream API > page. Be sure to use try...finally or the Java 7 try-with resources to be > sure the TokenStream is closed after using. This will also close the reader > (which is not needed for StringReaders, but TokenStream needs the > additional cleanup of other internal resources when calling close - the > state machine ensures this is done). > > One additional tip: In Lucene 4.6+ it is no longer needed to pass a > StringReader to analyze Strings. There is a second method in Analyzer that > takes a String to analyze (instead of Reader). This one uses an optimized > workflow internally. > > Uwe > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > > -----Original Message----- > > From: Joe Wong [mailto:jw...@adacado.com] > > Sent: Thursday, March 20, 2014 11:13 PM > > To: java-user@lucene.apache.org > > Subject: Re: Possible issue with Tokenizer in > lucene-analyzers-common-4.6.1 > > > > Hi Uwe, > > > > Thanks for the reply. I'm not familiar with the usage of Lucene so any > help > > would be appreciated. > > > > In our test we are executing several consecutive stemming operations > > (exception is thrown when the second stemmer.stem() method is called). In > > the code, see below, it does call the reset() method but like you say it > could > > be called at the wrong place. > > > > @Test > > public void stem() { > > LuceneStemmer stemmer = new LuceneStemmer(); > > assertEquals("thing", stemmer.stem("thing")); > > assertEquals("thing", stemmer.stem("things")); > > assertEquals("genius", stemmer.stem("geniuses")); > > assertEquals("fri", stemmer.stem("fries")); > > assertEquals("gentli", stemmer.stem("gently")); > > } > > > > --- LuceneStemmer class --- > > import java.io.IOException; > > import java.io.StringReader; > > import org.apache.lucene.analysis.Analyzer; > > import org.apache.lucene.analysis.TokenStream; > > import org.apache.lucene.analysis.en.PorterStemFilter; > > import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; > > > > > > public class LuceneStemmer implements Stemmer { > > > > /** Analyzer that tokenizes on whitespace, and lower-cases and stems > > words */ > > private Analyzer analyzer = new StemmingAnalyzer(); > > > > /** > > * Returns version of text with all words lower-cased and stemmed > > * @param text String to stem > > * @return stemmed text > > */ > > @Override > > public String stem(String text) { > > StringBuilder builder = new StringBuilder(); > > try { > > TokenStream tokenStream = analyzer.tokenStream(null, new > > StringReader(text)); > > tokenStream.reset(); > > > > CharTermAttribute termAttribute = > > tokenStream.getAttribute(CharTermAttribute.class); > > while (tokenStream.incrementToken()) { > > if (builder.length() > 0) { > > builder.append(' '); > > } > > builder.append(termAttribute.toString()); > > } > > } catch (IOException e) { > > // shouldn't happen reading from a StringReader, but you > never know > > throw new RuntimeException(e.getMessage(), e); > > } > > return builder.toString(); > > } > > > > } > > > > --- StemmingAnalyzer class ---- > > > > import com.google.common.collect.Sets; > > import java.io.Reader; > > import java.util.Collections; > > import java.util.Set; > > import org.apache.lucene.analysis.Analyzer; > > import org.apache.lucene.analysis.TokenStream; > > import org.apache.lucene.analysis.Tokenizer; > > import org.apache.lucene.analysis.core.LowerCaseFilter; > > import org.apache.lucene.analysis.core.StopFilter; > > import org.apache.lucene.analysis.core.WhitespaceTokenizer; > > import org.apache.lucene.analysis.en.PorterStemFilter; > > import org.apache.lucene.analysis.util.CharArraySet; > > > > import static com.adacado.Constants.*; > > > > public final class StemmingAnalyzer extends Analyzer { > > > > private Set<String> stopWords; > > > > public StemmingAnalyzer() { > > this.stopWords = Collections.EMPTY_SET; > > } > > > > public StemmingAnalyzer(Set<String> stopWords) { > > this.stopWords = stopWords; > > } > > > > public StemmingAnalyzer(String... stopWords) { > > this.stopWords = Sets.newHashSet(stopWords); > > } > > > > @Override > > protected TokenStreamComponents createComponents(String fieldName, > > Reader reader) { > > Tokenizer source = new WhitespaceTokenizer(LUCENE_VERSION, > > reader); > > TokenStream filter = new StopFilter(LUCENE_VERSION, > > new PorterStemFilter(new > > LowerCaseFilter(LUCENE_VERSION, source)), > > > > CharArraySet.copy(LUCENE_VERSION, stopWords)); > > return new TokenStreamComponents(source, filter); > > } > > > > } > > > > Stack trace > > java.lang.IllegalStateException: TokenStream contract violation: close() > call > > missing at > > org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:89) > > at > > org.apache.lucene.analysis.Analyzer$TokenStreamComponents.setReader(A > > nalyzer.java:307) > > at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:145) > > at LuceneStemmer.stem(LuceneStemmer.java:28) > > at LuceneStemmerTest.stem(LuceneStemmerTest.java:16) > > > > Thanks. > > > > Regards, > > Joe > > > > > > On Thu, Mar 20, 2014 at 1:40 PM, Uwe Schindler <u...@thetaphi.de> wrote: > > > > > Hi Joe, > > > > > > in Lucene 4.6, the TokenStream/Tokenizer APIs got some additional > > > state machine checks to ensure that consumers and subclasses of those > > > abstract interfaces are implemented in a correct way - they are not > > > easy to understand, because they are implemented in that way to ensure > > > they don't affect performance. If your test case consumes the > > > Tokenizer/TokenStream in a wrong way (e.g. missing to call reset() or > > > setReader() at correct places), an IllegalStateException is thrown. > > > The ILLEGAL_STATE_READER is there to ensure that the consumer gets a > > > correct exception if it calls > > > setReader() or reset() in the wrong order (or multiple times). > > > > > > The checks in the base class are definitely OK, if you hit the > > > IllegalStateException, your have some problems in your implementation > > > of the Tokenizer/TokenStream interface (e.g. missing super() calls or > > > calling > > > reset() from inside setReader() or whatever). Or, the consumer does > > > not respect the full documented workflow: > > > > > http://lucene.apache.org/core/4_6_1/core/org/apache/lucene/analysis/To > > > kenStream.html > > > > > > If you have TokenFilters in your analysis chain, the source of error > > > may also be missing super delegations in reset(), end(),... If you > > > need further help, post your implementation of the consumer in your > > > test case or post your analysis chain and custom Tokenizers. You may > > > also post the stack trace in addition, because this helps to find out > > > what call sequence you have. > > > > > > Uwe > > > > > > ----- > > > Uwe Schindler > > > H.-H.-Meier-Allee 63, D-28213 Bremen > > > http://www.thetaphi.de > > > eMail: u...@thetaphi.de > > > > > > > -----Original Message----- > > > > From: Joe Wong [mailto:jw...@adacado.com] > > > > Sent: Thursday, March 20, 2014 8:58 PM > > > > To: java-user@lucene.apache.org > > > > Subject: Possible issue with Tokenizer in > > > > lucene-analyzers-common-4.6.1 > > > > > > > > Hi > > > > > > > > We're planning to upgrade lucene-analyzers-commons 4.3.0 to 4.6.1 . > > > While > > > > running our unit test with 4.6.1 it fails at > > > > org.apache.lucene.analysis.Tokenizer on line 88 (setReader method). > > > > There it checks if input != ILLEGAL_STATE_READER then throws > > > > IllegalStateException. Should it not be if input == > > ILLEGAL_STATE_READER? > > > > > > > > Regards, > > > > Joe > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >