RE: Possible issue with Tokenizer in lucene-analyzers-common-4.6.1

Uwe Schindler Thu, 20 Mar 2014 16:03:18 -0700

Hi,

I am glad that I was able to help you!


One more optimization in your consumer: CharTermAttribute implements 
CharSequence, so you can directly append it to StringBuilder, no need to call 
toString(), see http://goo.gl/Ffg9tW:
        builder.append(termAttribute);
This will save additional useless String object instantiation.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -----Original Message-----
> From: Joe Wong [mailto:jw...@adacado.com]
> Sent: Thursday, March 20, 2014 11:50 PM
> To: java-user@lucene.apache.org
> Subject: Re: Possible issue with Tokenizer in lucene-analyzers-common-4.6.1
> 
> Thanks Uwe. It worked.
> 
> 
> 
> 
> On Thu, Mar 20, 2014 at 3:28 PM, Uwe Schindler <u...@thetaphi.de> wrote:
> 
> > Hi,
> >
> > the IllegalStateException tells you what's wrong: "TokenStream
> > contract
> > violation: close() call missing"
> >
> > Analyzer internally reuses TokenStreams, so if you call
> > Analyzer.tokenStream() a second time it will return the same instance
> > of your TokenStream. On that second call the state machine detects
> > that it was not closed before, which is easy to see:
> > Your consumer never closes the tokenstream returned by the analyzer
> > after it finishes the incrementToken() loop. This is why I said that
> > you have to follow the official consuming workflow as described on the
> > TokenStream API page. Be sure to use try...finally or the Java 7
> > try-with resources to be sure the TokenStream is closed after using.
> > This will also close the reader (which is not needed for
> > StringReaders, but TokenStream needs the additional cleanup of other
> > internal resources when calling close - the state machine ensures this is
> done).
> >
> > One additional tip: In Lucene 4.6+ it is no longer needed to pass a
> > StringReader to analyze Strings. There is a second method in Analyzer
> > that takes a String to analyze (instead of Reader). This one uses an
> > optimized workflow internally.
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: u...@thetaphi.de
> >
> >
> > > -----Original Message-----
> > > From: Joe Wong [mailto:jw...@adacado.com]
> > > Sent: Thursday, March 20, 2014 11:13 PM
> > > To: java-user@lucene.apache.org
> > > Subject: Re: Possible issue with Tokenizer in
> > lucene-analyzers-common-4.6.1
> > >
> > > Hi Uwe,
> > >
> > > Thanks for the reply. I'm not familiar with the usage of Lucene so
> > > any
> > help
> > > would be appreciated.
> > >
> > > In our test we are executing several consecutive stemming operations
> > > (exception is thrown when the second stemmer.stem() method is
> > > called). In the code, see below, it does call the reset() method but
> > > like you say it
> > could
> > > be called at the wrong place.
> > >
> > > @Test
> > >     public void stem() {
> > >         LuceneStemmer stemmer = new LuceneStemmer();
> > >         assertEquals("thing", stemmer.stem("thing"));
> > >         assertEquals("thing", stemmer.stem("things"));
> > >         assertEquals("genius", stemmer.stem("geniuses"));
> > >         assertEquals("fri", stemmer.stem("fries"));
> > >         assertEquals("gentli", stemmer.stem("gently"));
> > >     }
> > >
> > > --- LuceneStemmer class ---
> > > import java.io.IOException;
> > > import java.io.StringReader;
> > > import org.apache.lucene.analysis.Analyzer;
> > > import org.apache.lucene.analysis.TokenStream;
> > > import org.apache.lucene.analysis.en.PorterStemFilter;
> > > import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
> > >
> > >
> > > public class LuceneStemmer implements Stemmer {
> > >
> > >     /** Analyzer that tokenizes on whitespace, and lower-cases and
> > > stems words */
> > >     private Analyzer analyzer = new StemmingAnalyzer();
> > >
> > >     /**
> > >      * Returns version of text with all words lower-cased and stemmed
> > >      * @param text String to stem
> > >      * @return stemmed text
> > >      */
> > >     @Override
> > >     public String stem(String text) {
> > >         StringBuilder builder = new StringBuilder();
> > >         try {
> > >             TokenStream tokenStream = analyzer.tokenStream(null, new
> > > StringReader(text));
> > >             tokenStream.reset();
> > >
> > >             CharTermAttribute termAttribute =
> > > tokenStream.getAttribute(CharTermAttribute.class);
> > >             while (tokenStream.incrementToken()) {
> > >                 if (builder.length() > 0) {
> > >                     builder.append(' ');
> > >                 }
> > >                 builder.append(termAttribute.toString());
> > >             }
> > >         } catch (IOException e) {
> > >             // shouldn't happen reading from a StringReader, but you
> > never know
> > >             throw new RuntimeException(e.getMessage(), e);
> > >         }
> > >         return builder.toString();
> > >     }
> > >
> > > }
> > >
> > > --- StemmingAnalyzer class ----
> > >
> > > import com.google.common.collect.Sets; import java.io.Reader; import
> > > java.util.Collections; import java.util.Set; import
> > > org.apache.lucene.analysis.Analyzer;
> > > import org.apache.lucene.analysis.TokenStream;
> > > import org.apache.lucene.analysis.Tokenizer;
> > > import org.apache.lucene.analysis.core.LowerCaseFilter;
> > > import org.apache.lucene.analysis.core.StopFilter;
> > > import org.apache.lucene.analysis.core.WhitespaceTokenizer;
> > > import org.apache.lucene.analysis.en.PorterStemFilter;
> > > import org.apache.lucene.analysis.util.CharArraySet;
> > >
> > > import static com.adacado.Constants.*;
> > >
> > > public final class StemmingAnalyzer extends Analyzer {
> > >
> > >     private Set<String> stopWords;
> > >
> > >     public StemmingAnalyzer() {
> > >         this.stopWords = Collections.EMPTY_SET;
> > >     }
> > >
> > >     public StemmingAnalyzer(Set<String> stopWords) {
> > >         this.stopWords = stopWords;
> > >     }
> > >
> > >     public StemmingAnalyzer(String... stopWords) {
> > >         this.stopWords = Sets.newHashSet(stopWords);
> > >     }
> > >
> > >     @Override
> > >     protected TokenStreamComponents createComponents(String
> > > fieldName, Reader reader) {
> > >         Tokenizer source = new WhitespaceTokenizer(LUCENE_VERSION,
> > > reader);
> > >         TokenStream filter = new StopFilter(LUCENE_VERSION,
> > >                                             new PorterStemFilter(new
> > > LowerCaseFilter(LUCENE_VERSION, source)),
> > >
> > > CharArraySet.copy(LUCENE_VERSION, stopWords));
> > >         return new TokenStreamComponents(source, filter);
> > >     }
> > >
> > > }
> > >
> > > Stack trace
> > > java.lang.IllegalStateException: TokenStream contract violation:
> > > close()
> > call
> > > missing at
> > > org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:89)
> > > at
> > >
> org.apache.lucene.analysis.Analyzer$TokenStreamComponents.setReader(
> > > A
> > > nalyzer.java:307)
> > > at
> > > org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:145)
> > > at LuceneStemmer.stem(LuceneStemmer.java:28)
> > > at LuceneStemmerTest.stem(LuceneStemmerTest.java:16)
> > >
> > > Thanks.
> > >
> > > Regards,
> > > Joe
> > >
> > >
> > > On Thu, Mar 20, 2014 at 1:40 PM, Uwe Schindler <u...@thetaphi.de>
> wrote:
> > >
> > > > Hi Joe,
> > > >
> > > > in Lucene 4.6, the TokenStream/Tokenizer APIs got some additional
> > > > state machine checks to ensure that consumers and subclasses of
> > > > those abstract interfaces are implemented in a correct way - they
> > > > are not easy to understand, because they are implemented in that
> > > > way to ensure they don't affect performance. If your test case
> > > > consumes the Tokenizer/TokenStream in a wrong way (e.g. missing to
> > > > call reset() or
> > > > setReader() at correct places), an IllegalStateException is thrown.
> > > > The ILLEGAL_STATE_READER is there to ensure that the consumer gets
> > > > a correct exception if it calls
> > > > setReader() or reset() in the wrong order (or multiple times).
> > > >
> > > > The checks in the base class are definitely OK, if you hit the
> > > > IllegalStateException, your have some problems in your
> > > > implementation of the Tokenizer/TokenStream interface (e.g.
> > > > missing super() calls or calling
> > > > reset() from inside setReader() or whatever). Or, the consumer
> > > > does not respect the full documented workflow:
> > > >
> > > http://lucene.apache.org/core/4_6_1/core/org/apache/lucene/analysis/
> > > To
> > > > kenStream.html
> > > >
> > > > If you have TokenFilters in your analysis chain, the source of
> > > > error may also be missing super delegations in reset(), end(),...
> > > > If you need further help, post your implementation of the consumer
> > > > in your test case or post your analysis chain and custom
> > > > Tokenizers. You may also post the stack trace in addition, because
> > > > this helps to find out what call sequence you have.
> > > >
> > > > Uwe
> > > >
> > > > -----
> > > > Uwe Schindler
> > > > H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de
> > > > eMail: u...@thetaphi.de
> > > >
> > > > > -----Original Message-----
> > > > > From: Joe Wong [mailto:jw...@adacado.com]
> > > > > Sent: Thursday, March 20, 2014 8:58 PM
> > > > > To: java-user@lucene.apache.org
> > > > > Subject: Possible issue with Tokenizer in
> > > > > lucene-analyzers-common-4.6.1
> > > > >
> > > > > Hi
> > > > >
> > > > > We're planning to upgrade lucene-analyzers-commons 4.3.0 to  4.6.1 .
> > > > While
> > > > > running our unit test with 4.6.1 it fails at
> > > > > org.apache.lucene.analysis.Tokenizer on line 88 (setReader method).
> > > > > There it checks if input != ILLEGAL_STATE_READER then throws
> > > > > IllegalStateException. Should it not be if input ==
> > > ILLEGAL_STATE_READER?
> > > > >
> > > > > Regards,
> > > > > Joe
> > > >
> > > >
> > > > ------------------------------------------------------------------
> > > > --- To unsubscribe, e-mail:
> > > > java-user-unsubscr...@lucene.apache.org
> > > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > > >
> > > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Possible issue with Tokenizer in lucene-analyzers-common-4.6.1

Reply via email to