Hi, I am glad that I was able to help you!
One more optimization in your consumer: CharTermAttribute implements CharSequence, so you can directly append it to StringBuilder, no need to call toString(), see http://goo.gl/Ffg9tW: builder.append(termAttribute); This will save additional useless String object instantiation. Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message----- > From: Joe Wong [mailto:jw...@adacado.com] > Sent: Thursday, March 20, 2014 11:50 PM > To: java-user@lucene.apache.org > Subject: Re: Possible issue with Tokenizer in lucene-analyzers-common-4.6.1 > > Thanks Uwe. It worked. > > > > > On Thu, Mar 20, 2014 at 3:28 PM, Uwe Schindler <u...@thetaphi.de> wrote: > > > Hi, > > > > the IllegalStateException tells you what's wrong: "TokenStream > > contract > > violation: close() call missing" > > > > Analyzer internally reuses TokenStreams, so if you call > > Analyzer.tokenStream() a second time it will return the same instance > > of your TokenStream. On that second call the state machine detects > > that it was not closed before, which is easy to see: > > Your consumer never closes the tokenstream returned by the analyzer > > after it finishes the incrementToken() loop. This is why I said that > > you have to follow the official consuming workflow as described on the > > TokenStream API page. Be sure to use try...finally or the Java 7 > > try-with resources to be sure the TokenStream is closed after using. > > This will also close the reader (which is not needed for > > StringReaders, but TokenStream needs the additional cleanup of other > > internal resources when calling close - the state machine ensures this is > done). > > > > One additional tip: In Lucene 4.6+ it is no longer needed to pass a > > StringReader to analyze Strings. There is a second method in Analyzer > > that takes a String to analyze (instead of Reader). This one uses an > > optimized workflow internally. > > > > Uwe > > > > ----- > > Uwe Schindler > > H.-H.-Meier-Allee 63, D-28213 Bremen > > http://www.thetaphi.de > > eMail: u...@thetaphi.de > > > > > > > -----Original Message----- > > > From: Joe Wong [mailto:jw...@adacado.com] > > > Sent: Thursday, March 20, 2014 11:13 PM > > > To: java-user@lucene.apache.org > > > Subject: Re: Possible issue with Tokenizer in > > lucene-analyzers-common-4.6.1 > > > > > > Hi Uwe, > > > > > > Thanks for the reply. I'm not familiar with the usage of Lucene so > > > any > > help > > > would be appreciated. > > > > > > In our test we are executing several consecutive stemming operations > > > (exception is thrown when the second stemmer.stem() method is > > > called). In the code, see below, it does call the reset() method but > > > like you say it > > could > > > be called at the wrong place. > > > > > > @Test > > > public void stem() { > > > LuceneStemmer stemmer = new LuceneStemmer(); > > > assertEquals("thing", stemmer.stem("thing")); > > > assertEquals("thing", stemmer.stem("things")); > > > assertEquals("genius", stemmer.stem("geniuses")); > > > assertEquals("fri", stemmer.stem("fries")); > > > assertEquals("gentli", stemmer.stem("gently")); > > > } > > > > > > --- LuceneStemmer class --- > > > import java.io.IOException; > > > import java.io.StringReader; > > > import org.apache.lucene.analysis.Analyzer; > > > import org.apache.lucene.analysis.TokenStream; > > > import org.apache.lucene.analysis.en.PorterStemFilter; > > > import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; > > > > > > > > > public class LuceneStemmer implements Stemmer { > > > > > > /** Analyzer that tokenizes on whitespace, and lower-cases and > > > stems words */ > > > private Analyzer analyzer = new StemmingAnalyzer(); > > > > > > /** > > > * Returns version of text with all words lower-cased and stemmed > > > * @param text String to stem > > > * @return stemmed text > > > */ > > > @Override > > > public String stem(String text) { > > > StringBuilder builder = new StringBuilder(); > > > try { > > > TokenStream tokenStream = analyzer.tokenStream(null, new > > > StringReader(text)); > > > tokenStream.reset(); > > > > > > CharTermAttribute termAttribute = > > > tokenStream.getAttribute(CharTermAttribute.class); > > > while (tokenStream.incrementToken()) { > > > if (builder.length() > 0) { > > > builder.append(' '); > > > } > > > builder.append(termAttribute.toString()); > > > } > > > } catch (IOException e) { > > > // shouldn't happen reading from a StringReader, but you > > never know > > > throw new RuntimeException(e.getMessage(), e); > > > } > > > return builder.toString(); > > > } > > > > > > } > > > > > > --- StemmingAnalyzer class ---- > > > > > > import com.google.common.collect.Sets; import java.io.Reader; import > > > java.util.Collections; import java.util.Set; import > > > org.apache.lucene.analysis.Analyzer; > > > import org.apache.lucene.analysis.TokenStream; > > > import org.apache.lucene.analysis.Tokenizer; > > > import org.apache.lucene.analysis.core.LowerCaseFilter; > > > import org.apache.lucene.analysis.core.StopFilter; > > > import org.apache.lucene.analysis.core.WhitespaceTokenizer; > > > import org.apache.lucene.analysis.en.PorterStemFilter; > > > import org.apache.lucene.analysis.util.CharArraySet; > > > > > > import static com.adacado.Constants.*; > > > > > > public final class StemmingAnalyzer extends Analyzer { > > > > > > private Set<String> stopWords; > > > > > > public StemmingAnalyzer() { > > > this.stopWords = Collections.EMPTY_SET; > > > } > > > > > > public StemmingAnalyzer(Set<String> stopWords) { > > > this.stopWords = stopWords; > > > } > > > > > > public StemmingAnalyzer(String... stopWords) { > > > this.stopWords = Sets.newHashSet(stopWords); > > > } > > > > > > @Override > > > protected TokenStreamComponents createComponents(String > > > fieldName, Reader reader) { > > > Tokenizer source = new WhitespaceTokenizer(LUCENE_VERSION, > > > reader); > > > TokenStream filter = new StopFilter(LUCENE_VERSION, > > > new PorterStemFilter(new > > > LowerCaseFilter(LUCENE_VERSION, source)), > > > > > > CharArraySet.copy(LUCENE_VERSION, stopWords)); > > > return new TokenStreamComponents(source, filter); > > > } > > > > > > } > > > > > > Stack trace > > > java.lang.IllegalStateException: TokenStream contract violation: > > > close() > > call > > > missing at > > > org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:89) > > > at > > > > org.apache.lucene.analysis.Analyzer$TokenStreamComponents.setReader( > > > A > > > nalyzer.java:307) > > > at > > > org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:145) > > > at LuceneStemmer.stem(LuceneStemmer.java:28) > > > at LuceneStemmerTest.stem(LuceneStemmerTest.java:16) > > > > > > Thanks. > > > > > > Regards, > > > Joe > > > > > > > > > On Thu, Mar 20, 2014 at 1:40 PM, Uwe Schindler <u...@thetaphi.de> > wrote: > > > > > > > Hi Joe, > > > > > > > > in Lucene 4.6, the TokenStream/Tokenizer APIs got some additional > > > > state machine checks to ensure that consumers and subclasses of > > > > those abstract interfaces are implemented in a correct way - they > > > > are not easy to understand, because they are implemented in that > > > > way to ensure they don't affect performance. If your test case > > > > consumes the Tokenizer/TokenStream in a wrong way (e.g. missing to > > > > call reset() or > > > > setReader() at correct places), an IllegalStateException is thrown. > > > > The ILLEGAL_STATE_READER is there to ensure that the consumer gets > > > > a correct exception if it calls > > > > setReader() or reset() in the wrong order (or multiple times). > > > > > > > > The checks in the base class are definitely OK, if you hit the > > > > IllegalStateException, your have some problems in your > > > > implementation of the Tokenizer/TokenStream interface (e.g. > > > > missing super() calls or calling > > > > reset() from inside setReader() or whatever). Or, the consumer > > > > does not respect the full documented workflow: > > > > > > > http://lucene.apache.org/core/4_6_1/core/org/apache/lucene/analysis/ > > > To > > > > kenStream.html > > > > > > > > If you have TokenFilters in your analysis chain, the source of > > > > error may also be missing super delegations in reset(), end(),... > > > > If you need further help, post your implementation of the consumer > > > > in your test case or post your analysis chain and custom > > > > Tokenizers. You may also post the stack trace in addition, because > > > > this helps to find out what call sequence you have. > > > > > > > > Uwe > > > > > > > > ----- > > > > Uwe Schindler > > > > H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de > > > > eMail: u...@thetaphi.de > > > > > > > > > -----Original Message----- > > > > > From: Joe Wong [mailto:jw...@adacado.com] > > > > > Sent: Thursday, March 20, 2014 8:58 PM > > > > > To: java-user@lucene.apache.org > > > > > Subject: Possible issue with Tokenizer in > > > > > lucene-analyzers-common-4.6.1 > > > > > > > > > > Hi > > > > > > > > > > We're planning to upgrade lucene-analyzers-commons 4.3.0 to 4.6.1 . > > > > While > > > > > running our unit test with 4.6.1 it fails at > > > > > org.apache.lucene.analysis.Tokenizer on line 88 (setReader method). > > > > > There it checks if input != ILLEGAL_STATE_READER then throws > > > > > IllegalStateException. Should it not be if input == > > > ILLEGAL_STATE_READER? > > > > > > > > > > Regards, > > > > > Joe > > > > > > > > > > > > ------------------------------------------------------------------ > > > > --- To unsubscribe, e-mail: > > > > java-user-unsubscr...@lucene.apache.org > > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org