Re: Possible issue with Tokenizer in lucene-analyzers-common-4.6.1

Joe Wong Thu, 20 Mar 2014 15:13:47 -0700

Hi Uwe,

Thanks for the reply. I'm not familiar with the usage of Lucene so any help
would be appreciated.


In our test we are executing several consecutive stemming operations
(exception is thrown when the second stemmer.stem() method is called). In
the code, see below, it does call the reset() method but like you say it
could be called at the wrong place.

@Test
    public void stem() {
        LuceneStemmer stemmer = new LuceneStemmer();
        assertEquals("thing", stemmer.stem("thing"));
        assertEquals("thing", stemmer.stem("things"));
        assertEquals("genius", stemmer.stem("geniuses"));
        assertEquals("fri", stemmer.stem("fries"));
        assertEquals("gentli", stemmer.stem("gently"));
    }

--- LuceneStemmer class ---
import java.io.IOException;
import java.io.StringReader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.en.PorterStemFilter;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;


public class LuceneStemmer implements Stemmer {

    /** Analyzer that tokenizes on whitespace, and lower-cases and stems
words */
    private Analyzer analyzer = new StemmingAnalyzer();

    /**
     * Returns version of text with all words lower-cased and stemmed
     * @param text String to stem
     * @return stemmed text
     */
    @Override
    public String stem(String text) {
        StringBuilder builder = new StringBuilder();
        try {
            TokenStream tokenStream = analyzer.tokenStream(null, new
StringReader(text));
            tokenStream.reset();

            CharTermAttribute termAttribute =
tokenStream.getAttribute(CharTermAttribute.class);
            while (tokenStream.incrementToken()) {
                if (builder.length() > 0) {
                    builder.append(' ');
                }
                builder.append(termAttribute.toString());
            }
        } catch (IOException e) {
            // shouldn't happen reading from a StringReader, but you never
know
            throw new RuntimeException(e.getMessage(), e);
        }
        return builder.toString();
    }

}

--- StemmingAnalyzer class ----

import com.google.common.collect.Sets;
import java.io.Reader;
import java.util.Collections;
import java.util.Set;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.core.StopFilter;
import org.apache.lucene.analysis.core.WhitespaceTokenizer;
import org.apache.lucene.analysis.en.PorterStemFilter;
import org.apache.lucene.analysis.util.CharArraySet;

import static com.adacado.Constants.*;

public final class StemmingAnalyzer extends Analyzer {

    private Set<String> stopWords;

    public StemmingAnalyzer() {
        this.stopWords = Collections.EMPTY_SET;
    }

    public StemmingAnalyzer(Set<String> stopWords) {
        this.stopWords = stopWords;
    }

    public StemmingAnalyzer(String... stopWords) {
        this.stopWords = Sets.newHashSet(stopWords);
    }

    @Override
    protected TokenStreamComponents createComponents(String fieldName,
Reader reader) {
        Tokenizer source = new WhitespaceTokenizer(LUCENE_VERSION, reader);
        TokenStream filter = new StopFilter(LUCENE_VERSION,
                                            new PorterStemFilter(new
LowerCaseFilter(LUCENE_VERSION, source)),

CharArraySet.copy(LUCENE_VERSION, stopWords));
        return new TokenStreamComponents(source, filter);
    }

}

Stack trace
java.lang.IllegalStateException: TokenStream contract violation: close()
call missing
at org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:89)
at
org.apache.lucene.analysis.Analyzer$TokenStreamComponents.setReader(Analyzer.java:307)
at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:145)
at LuceneStemmer.stem(LuceneStemmer.java:28)
at LuceneStemmerTest.stem(LuceneStemmerTest.java:16)

Thanks.

Regards,
Joe


On Thu, Mar 20, 2014 at 1:40 PM, Uwe Schindler <[email protected]> wrote:

> Hi Joe,
>
> in Lucene 4.6, the TokenStream/Tokenizer APIs got some additional state
> machine checks to ensure that consumers and subclasses of those abstract
> interfaces are implemented in a correct way - they are not easy to
> understand, because they are implemented in that way to ensure they don't
> affect performance. If your test case consumes the Tokenizer/TokenStream in
> a wrong way (e.g. missing to call reset() or setReader() at correct
> places), an IllegalStateException is thrown. The ILLEGAL_STATE_READER is
> there to ensure that the consumer gets a correct exception if it calls
> setReader() or reset() in the wrong order (or multiple times).
>
> The checks in the base class are definitely OK, if you hit the
> IllegalStateException, your have some problems in your implementation of
> the Tokenizer/TokenStream interface (e.g. missing super() calls or calling
> reset() from inside setReader() or whatever). Or, the consumer does not
> respect the full documented workflow:
> http://lucene.apache.org/core/4_6_1/core/org/apache/lucene/analysis/TokenStream.html
>
> If you have TokenFilters in your analysis chain, the source of error may
> also be missing super delegations in reset(), end(),... If you need further
> help, post your implementation of the consumer in your test case or post
> your analysis chain and custom Tokenizers. You may also post the stack
> trace in addition, because this helps to find out what call sequence you
> have.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [email protected]
>
> > -----Original Message-----
> > From: Joe Wong [mailto:[email protected]]
> > Sent: Thursday, March 20, 2014 8:58 PM
> > To: [email protected]
> > Subject: Possible issue with Tokenizer in lucene-analyzers-common-4.6.1
> >
> > Hi
> >
> > We're planning to upgrade lucene-analyzers-commons 4.3.0 to  4.6.1 .
> While
> > running our unit test with 4.6.1 it fails at
> > org.apache.lucene.analysis.Tokenizer on line 88 (setReader method). There
> > it checks if input != ILLEGAL_STATE_READER then throws
> > IllegalStateException. Should it not be if input == ILLEGAL_STATE_READER?
> >
> > Regards,
> > Joe
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Possible issue with Tokenizer in lucene-analyzers-common-4.6.1

Reply via email to