Re: Possible issue with Tokenizer in lucene-analyzers-common-4.6.1

Joe Wong Thu, 20 Mar 2014 15:51:11 -0700

Thanks Uwe. It worked.




On Thu, Mar 20, 2014 at 3:28 PM, Uwe Schindler <u...@thetaphi.de> wrote:

> Hi,
>
> the IllegalStateException tells you what's wrong: "TokenStream contract
> violation: close() call missing"
>
> Analyzer internally reuses TokenStreams, so if you call
> Analyzer.tokenStream() a second time it will return the same instance of
> your TokenStream. On that second call the state machine detects that it was
> not closed before, which is easy to see:
> Your consumer never closes the tokenstream returned by the analyzer after
> it finishes the incrementToken() loop. This is why I said that you have to
> follow the official consuming workflow as described on the TokenStream API
> page. Be sure to use try...finally or the Java 7 try-with resources to be
> sure the TokenStream is closed after using. This will also close the reader
> (which is not needed for StringReaders, but TokenStream needs the
> additional cleanup of other internal resources when calling close - the
> state machine ensures this is done).
>
> One additional tip: In Lucene 4.6+ it is no longer needed to pass a
> StringReader to analyze Strings. There is a second method in Analyzer that
> takes a String to analyze (instead of Reader). This one uses an optimized
> workflow internally.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> > -----Original Message-----
> > From: Joe Wong [mailto:jw...@adacado.com]
> > Sent: Thursday, March 20, 2014 11:13 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Possible issue with Tokenizer in
> lucene-analyzers-common-4.6.1
> >
> > Hi Uwe,
> >
> > Thanks for the reply. I'm not familiar with the usage of Lucene so any
> help
> > would be appreciated.
> >
> > In our test we are executing several consecutive stemming operations
> > (exception is thrown when the second stemmer.stem() method is called). In
> > the code, see below, it does call the reset() method but like you say it
> could
> > be called at the wrong place.
> >
> > @Test
> >     public void stem() {
> >         LuceneStemmer stemmer = new LuceneStemmer();
> >         assertEquals("thing", stemmer.stem("thing"));
> >         assertEquals("thing", stemmer.stem("things"));
> >         assertEquals("genius", stemmer.stem("geniuses"));
> >         assertEquals("fri", stemmer.stem("fries"));
> >         assertEquals("gentli", stemmer.stem("gently"));
> >     }
> >
> > --- LuceneStemmer class ---
> > import java.io.IOException;
> > import java.io.StringReader;
> > import org.apache.lucene.analysis.Analyzer;
> > import org.apache.lucene.analysis.TokenStream;
> > import org.apache.lucene.analysis.en.PorterStemFilter;
> > import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
> >
> >
> > public class LuceneStemmer implements Stemmer {
> >
> >     /** Analyzer that tokenizes on whitespace, and lower-cases and stems
> > words */
> >     private Analyzer analyzer = new StemmingAnalyzer();
> >
> >     /**
> >      * Returns version of text with all words lower-cased and stemmed
> >      * @param text String to stem
> >      * @return stemmed text
> >      */
> >     @Override
> >     public String stem(String text) {
> >         StringBuilder builder = new StringBuilder();
> >         try {
> >             TokenStream tokenStream = analyzer.tokenStream(null, new
> > StringReader(text));
> >             tokenStream.reset();
> >
> >             CharTermAttribute termAttribute =
> > tokenStream.getAttribute(CharTermAttribute.class);
> >             while (tokenStream.incrementToken()) {
> >                 if (builder.length() > 0) {
> >                     builder.append(' ');
> >                 }
> >                 builder.append(termAttribute.toString());
> >             }
> >         } catch (IOException e) {
> >             // shouldn't happen reading from a StringReader, but you
> never know
> >             throw new RuntimeException(e.getMessage(), e);
> >         }
> >         return builder.toString();
> >     }
> >
> > }
> >
> > --- StemmingAnalyzer class ----
> >
> > import com.google.common.collect.Sets;
> > import java.io.Reader;
> > import java.util.Collections;
> > import java.util.Set;
> > import org.apache.lucene.analysis.Analyzer;
> > import org.apache.lucene.analysis.TokenStream;
> > import org.apache.lucene.analysis.Tokenizer;
> > import org.apache.lucene.analysis.core.LowerCaseFilter;
> > import org.apache.lucene.analysis.core.StopFilter;
> > import org.apache.lucene.analysis.core.WhitespaceTokenizer;
> > import org.apache.lucene.analysis.en.PorterStemFilter;
> > import org.apache.lucene.analysis.util.CharArraySet;
> >
> > import static com.adacado.Constants.*;
> >
> > public final class StemmingAnalyzer extends Analyzer {
> >
> >     private Set<String> stopWords;
> >
> >     public StemmingAnalyzer() {
> >         this.stopWords = Collections.EMPTY_SET;
> >     }
> >
> >     public StemmingAnalyzer(Set<String> stopWords) {
> >         this.stopWords = stopWords;
> >     }
> >
> >     public StemmingAnalyzer(String... stopWords) {
> >         this.stopWords = Sets.newHashSet(stopWords);
> >     }
> >
> >     @Override
> >     protected TokenStreamComponents createComponents(String fieldName,
> > Reader reader) {
> >         Tokenizer source = new WhitespaceTokenizer(LUCENE_VERSION,
> > reader);
> >         TokenStream filter = new StopFilter(LUCENE_VERSION,
> >                                             new PorterStemFilter(new
> > LowerCaseFilter(LUCENE_VERSION, source)),
> >
> > CharArraySet.copy(LUCENE_VERSION, stopWords));
> >         return new TokenStreamComponents(source, filter);
> >     }
> >
> > }
> >
> > Stack trace
> > java.lang.IllegalStateException: TokenStream contract violation: close()
> call
> > missing at
> > org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:89)
> > at
> > org.apache.lucene.analysis.Analyzer$TokenStreamComponents.setReader(A
> > nalyzer.java:307)
> > at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:145)
> > at LuceneStemmer.stem(LuceneStemmer.java:28)
> > at LuceneStemmerTest.stem(LuceneStemmerTest.java:16)
> >
> > Thanks.
> >
> > Regards,
> > Joe
> >
> >
> > On Thu, Mar 20, 2014 at 1:40 PM, Uwe Schindler <u...@thetaphi.de> wrote:
> >
> > > Hi Joe,
> > >
> > > in Lucene 4.6, the TokenStream/Tokenizer APIs got some additional
> > > state machine checks to ensure that consumers and subclasses of those
> > > abstract interfaces are implemented in a correct way - they are not
> > > easy to understand, because they are implemented in that way to ensure
> > > they don't affect performance. If your test case consumes the
> > > Tokenizer/TokenStream in a wrong way (e.g. missing to call reset() or
> > > setReader() at correct places), an IllegalStateException is thrown.
> > > The ILLEGAL_STATE_READER is there to ensure that the consumer gets a
> > > correct exception if it calls
> > > setReader() or reset() in the wrong order (or multiple times).
> > >
> > > The checks in the base class are definitely OK, if you hit the
> > > IllegalStateException, your have some problems in your implementation
> > > of the Tokenizer/TokenStream interface (e.g. missing super() calls or
> > > calling
> > > reset() from inside setReader() or whatever). Or, the consumer does
> > > not respect the full documented workflow:
> > >
> > http://lucene.apache.org/core/4_6_1/core/org/apache/lucene/analysis/To
> > > kenStream.html
> > >
> > > If you have TokenFilters in your analysis chain, the source of error
> > > may also be missing super delegations in reset(), end(),... If you
> > > need further help, post your implementation of the consumer in your
> > > test case or post your analysis chain and custom Tokenizers. You may
> > > also post the stack trace in addition, because this helps to find out
> > > what call sequence you have.
> > >
> > > Uwe
> > >
> > > -----
> > > Uwe Schindler
> > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > http://www.thetaphi.de
> > > eMail: u...@thetaphi.de
> > >
> > > > -----Original Message-----
> > > > From: Joe Wong [mailto:jw...@adacado.com]
> > > > Sent: Thursday, March 20, 2014 8:58 PM
> > > > To: java-user@lucene.apache.org
> > > > Subject: Possible issue with Tokenizer in
> > > > lucene-analyzers-common-4.6.1
> > > >
> > > > Hi
> > > >
> > > > We're planning to upgrade lucene-analyzers-commons 4.3.0 to  4.6.1 .
> > > While
> > > > running our unit test with 4.6.1 it fails at
> > > > org.apache.lucene.analysis.Tokenizer on line 88 (setReader method).
> > > > There it checks if input != ILLEGAL_STATE_READER then throws
> > > > IllegalStateException. Should it not be if input ==
> > ILLEGAL_STATE_READER?
> > > >
> > > > Regards,
> > > > Joe
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Possible issue with Tokenizer in lucene-analyzers-common-4.6.1

Reply via email to