Re: Erroneous tokenization behavior

Steve Rowe Tue, 13 Sep 2016 17:52:46 -0700

Hi Sattam,

You’re right, StandardTokenizer's behavior changed (in 4.9.1/4.10) to split 
long tokens at maxTokenLength rather than ignore tokens longer than 
maxTokenLength.


You can simulate the old behavior by setting maxTokenLength to the length of 
the longest token you want to be able to ignore, and then adding a LengthFilter 
to your analysis chain that eliminates too-long tokens.

E.g.:

import java.io.IOException;
import java.io.Reader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.miscellaneous.LengthFilter;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;


public class Test {
  public static void main(String[] args) throws IOException {
    Analyzer analyzer = new Analyzer() {
      @Override
      protected TokenStreamComponents createComponents(String fieldName, Reader 
reader) {
        StandardTokenizer tokenizer = new StandardTokenizer(reader);
        tokenizer.setMaxTokenLength(10); // Longest expected token: 10 chars
        return new TokenStreamComponents(tokenizer, new LengthFilter(tokenizer, 
1, 4));
      }
    };
    TokenStream stream = analyzer.tokenStream("dummy", "Tokenize me!");
    stream.reset();
    CharTermAttribute termAttr = stream.addAttribute(CharTermAttribute.class);
    while (stream.incrementToken()) {
      String term = termAttr.toString();
      System.out.println(term);
    }
    stream.end();
    stream.close();
  }
}

--
Steve
www.lucidworks.com

> On Sep 13, 2016, at 7:59 PM, Sattam Alsubaiee <[email protected]> wrote:
> 
> Hi Michael,
> 
> Yes, that's the desired behavior. The setMaxTokenLength method is supposed to 
> allow that.
> 
> Cheers,
> Sattam
> 
> 
> On Tue, Sep 13, 2016 at 11:57 AM, Michael McCandless 
> <[email protected]> wrote:
> I guess this was a change in behavior in those versions.
> 
> Are you wanting to discard the too-long terms (the 4.7.x behavior)?
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> 
> On Tue, Sep 13, 2016 at 12:42 AM, Sattam Alsubaiee <[email protected]> 
> wrote:
> > I'm trying to understand the tokenization behavior in Lucene. When using the
> > StandardTokenizer in Lucene version 4.7.1, and trying to tokenize the
> > following string "Tokenize me!" with max token filter set to be 4, I get
> > only the token "me", but when using Lucene version 4.10.4, I get the
> > following tokens "Toke", "nize", and "me".
> >
> > When debugging what's happening, I see that the scanner in version 4.10.4
> > reads only x number of bytes and then apply the tokenization, where x is the
> > max token length passed by the user. While in version 4.7.1, the scanner
> > fills the buffer irrespective of the max token length (it uses the default
> > buffer size to decide number of bytes it reads every time).
> >
> > This is the commit that made the change:
> > https://github.com/apache/lucene-solr/commit/33204ddd895a26a56c1edd92594800ef285f0d4a
> >
> > You can see in StandardTokenizer.java that this code was added and caused
> > this behavior:
> > if (scanner instanceof StandardTokenizerImpl) {
> >      scanner.setBufferSize(Math.min(length, 1024 * 1024)); // limit buffer
> > size to 1M chars
> > }
> >
> > I also see the same code in master.
> >
> > Thanks,
> > Sattam
> >
> > p.s. Here is the code to reproduce what I'm seeing.
> > version 4.7.1 (using the jar files here
> > http://archive.apache.org/dist/lucene/java/4.7.1/)
> >
> >
> > import java.io.IOException;
> >
> > import java.io.StringReader;
> >
> > import org.apache.lucene.analysis.standard.StandardTokenizer;
> >
> > import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
> >
> > import org.apache.lucene.util.AttributeSource.AttributeFactory;
> >
> > import org.apache.lucene.util.Version;
> >
> >
> > public class Test {
> >
> >     public static void main(String[] args) throws IOException {
> >
> >         AttributeFactory factory =
> > AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY;
> >
> >         StandardTokenizer tokenizer = new
> > StandardTokenizer(Version.LUCENE_47, factory, new StringReader("Tokenize
> > me!"));
> >
> >         tokenizer.setMaxTokenLength(4);
> >
> >         tokenizer.reset();
> >
> >         CharTermAttribute attr =
> > tokenizer.addAttribute(CharTermAttribute.class);
> >
> >         while (tokenizer.incrementToken()) {
> >
> >             String term = attr.toString();
> >
> >             System.out.println(term);
> >
> >         }
> >
> >     }
> >
> > }
> >
> >
> > version 4.10.4 (using the jar files here
> > http://archive.apache.org/dist/lucene/java/4.10.4/)
> >
> > import java.io.IOException;
> >
> > import java.io.StringReader;
> >
> > import org.apache.lucene.analysis.standard.StandardTokenizer;
> >
> > import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
> >
> > import org.apache.lucene.util.AttributeFactory;
> >
> >
> > public class Test {
> >
> >     public static void main(String[] args) throws IOException {
> >
> >         AttributeFactory factory =
> > AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY;
> >
> >         StandardTokenizer tokenizer = new StandardTokenizer(factory, new
> > StringReader("Tokenize me!"));
> >
> >         tokenizer.setMaxTokenLength(4);
> >
> >         tokenizer.reset();
> >
> >         CharTermAttribute attr =
> > tokenizer.addAttribute(CharTermAttribute.class);
> >
> >         while (tokenizer.incrementToken()) {
> >
> >             String term = attr.toString();
> >
> >             System.out.println(term);
> >
> >         }
> >
> >     }
> >
> > }
> >
> >
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Erroneous tokenization behavior

Reply via email to