RE: Possible issue with Tokenizer in lucene-analyzers-common-4.6.1

Uwe Schindler Thu, 20 Mar 2014 15:29:25 -0700

Hi,

the IllegalStateException tells you what's wrong: "TokenStream contract 
violation: close() call missing"


Analyzer internally reuses TokenStreams, so if you call Analyzer.tokenStream() 
a second time it will return the same instance of your TokenStream. On that 
second call the state machine detects that it was not closed before, which is 
easy to see:
Your consumer never closes the tokenstream returned by the analyzer after it 
finishes the incrementToken() loop. This is why I said that you have to follow 
the official consuming workflow as described on the TokenStream API page. Be 
sure to use try...finally or the Java 7 try-with resources to be sure the 
TokenStream is closed after using. This will also close the reader (which is 
not needed for StringReaders, but TokenStream needs the additional cleanup of 
other internal resources when calling close - the state machine ensures this is 
done).

One additional tip: In Lucene 4.6+ it is no longer needed to pass a 
StringReader to analyze Strings. There is a second method in Analyzer that 
takes a String to analyze (instead of Reader). This one uses an optimized 
workflow internally.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]


> -----Original Message-----
> From: Joe Wong [mailto:[email protected]]
> Sent: Thursday, March 20, 2014 11:13 PM
> To: [email protected]
> Subject: Re: Possible issue with Tokenizer in lucene-analyzers-common-4.6.1
> 
> Hi Uwe,
> 
> Thanks for the reply. I'm not familiar with the usage of Lucene so any help
> would be appreciated.
> 
> In our test we are executing several consecutive stemming operations
> (exception is thrown when the second stemmer.stem() method is called). In
> the code, see below, it does call the reset() method but like you say it could
> be called at the wrong place.
> 
> @Test
>     public void stem() {
>         LuceneStemmer stemmer = new LuceneStemmer();
>         assertEquals("thing", stemmer.stem("thing"));
>         assertEquals("thing", stemmer.stem("things"));
>         assertEquals("genius", stemmer.stem("geniuses"));
>         assertEquals("fri", stemmer.stem("fries"));
>         assertEquals("gentli", stemmer.stem("gently"));
>     }
> 
> --- LuceneStemmer class ---
> import java.io.IOException;
> import java.io.StringReader;
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.en.PorterStemFilter;
> import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
> 
> 
> public class LuceneStemmer implements Stemmer {
> 
>     /** Analyzer that tokenizes on whitespace, and lower-cases and stems
> words */
>     private Analyzer analyzer = new StemmingAnalyzer();
> 
>     /**
>      * Returns version of text with all words lower-cased and stemmed
>      * @param text String to stem
>      * @return stemmed text
>      */
>     @Override
>     public String stem(String text) {
>         StringBuilder builder = new StringBuilder();
>         try {
>             TokenStream tokenStream = analyzer.tokenStream(null, new
> StringReader(text));
>             tokenStream.reset();
> 
>             CharTermAttribute termAttribute =
> tokenStream.getAttribute(CharTermAttribute.class);
>             while (tokenStream.incrementToken()) {
>                 if (builder.length() > 0) {
>                     builder.append(' ');
>                 }
>                 builder.append(termAttribute.toString());
>             }
>         } catch (IOException e) {
>             // shouldn't happen reading from a StringReader, but you never 
> know
>             throw new RuntimeException(e.getMessage(), e);
>         }
>         return builder.toString();
>     }
> 
> }
> 
> --- StemmingAnalyzer class ----
> 
> import com.google.common.collect.Sets;
> import java.io.Reader;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.Tokenizer;
> import org.apache.lucene.analysis.core.LowerCaseFilter;
> import org.apache.lucene.analysis.core.StopFilter;
> import org.apache.lucene.analysis.core.WhitespaceTokenizer;
> import org.apache.lucene.analysis.en.PorterStemFilter;
> import org.apache.lucene.analysis.util.CharArraySet;
> 
> import static com.adacado.Constants.*;
> 
> public final class StemmingAnalyzer extends Analyzer {
> 
>     private Set<String> stopWords;
> 
>     public StemmingAnalyzer() {
>         this.stopWords = Collections.EMPTY_SET;
>     }
> 
>     public StemmingAnalyzer(Set<String> stopWords) {
>         this.stopWords = stopWords;
>     }
> 
>     public StemmingAnalyzer(String... stopWords) {
>         this.stopWords = Sets.newHashSet(stopWords);
>     }
> 
>     @Override
>     protected TokenStreamComponents createComponents(String fieldName,
> Reader reader) {
>         Tokenizer source = new WhitespaceTokenizer(LUCENE_VERSION,
> reader);
>         TokenStream filter = new StopFilter(LUCENE_VERSION,
>                                             new PorterStemFilter(new
> LowerCaseFilter(LUCENE_VERSION, source)),
> 
> CharArraySet.copy(LUCENE_VERSION, stopWords));
>         return new TokenStreamComponents(source, filter);
>     }
> 
> }
> 
> Stack trace
> java.lang.IllegalStateException: TokenStream contract violation: close() call
> missing at
> org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:89)
> at
> org.apache.lucene.analysis.Analyzer$TokenStreamComponents.setReader(A
> nalyzer.java:307)
> at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:145)
> at LuceneStemmer.stem(LuceneStemmer.java:28)
> at LuceneStemmerTest.stem(LuceneStemmerTest.java:16)
> 
> Thanks.
> 
> Regards,
> Joe
> 
> 
> On Thu, Mar 20, 2014 at 1:40 PM, Uwe Schindler <[email protected]> wrote:
> 
> > Hi Joe,
> >
> > in Lucene 4.6, the TokenStream/Tokenizer APIs got some additional
> > state machine checks to ensure that consumers and subclasses of those
> > abstract interfaces are implemented in a correct way - they are not
> > easy to understand, because they are implemented in that way to ensure
> > they don't affect performance. If your test case consumes the
> > Tokenizer/TokenStream in a wrong way (e.g. missing to call reset() or
> > setReader() at correct places), an IllegalStateException is thrown.
> > The ILLEGAL_STATE_READER is there to ensure that the consumer gets a
> > correct exception if it calls
> > setReader() or reset() in the wrong order (or multiple times).
> >
> > The checks in the base class are definitely OK, if you hit the
> > IllegalStateException, your have some problems in your implementation
> > of the Tokenizer/TokenStream interface (e.g. missing super() calls or
> > calling
> > reset() from inside setReader() or whatever). Or, the consumer does
> > not respect the full documented workflow:
> >
> http://lucene.apache.org/core/4_6_1/core/org/apache/lucene/analysis/To
> > kenStream.html
> >
> > If you have TokenFilters in your analysis chain, the source of error
> > may also be missing super delegations in reset(), end(),... If you
> > need further help, post your implementation of the consumer in your
> > test case or post your analysis chain and custom Tokenizers. You may
> > also post the stack trace in addition, because this helps to find out
> > what call sequence you have.
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: [email protected]
> >
> > > -----Original Message-----
> > > From: Joe Wong [mailto:[email protected]]
> > > Sent: Thursday, March 20, 2014 8:58 PM
> > > To: [email protected]
> > > Subject: Possible issue with Tokenizer in
> > > lucene-analyzers-common-4.6.1
> > >
> > > Hi
> > >
> > > We're planning to upgrade lucene-analyzers-commons 4.3.0 to  4.6.1 .
> > While
> > > running our unit test with 4.6.1 it fails at
> > > org.apache.lucene.analysis.Tokenizer on line 88 (setReader method).
> > > There it checks if input != ILLEGAL_STATE_READER then throws
> > > IllegalStateException. Should it not be if input ==
> ILLEGAL_STATE_READER?
> > >
> > > Regards,
> > > Joe
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: Possible issue with Tokenizer in lucene-analyzers-common-4.6.1

Reply via email to