Erroneous tokenization behavior

Sattam Alsubaiee Mon, 12 Sep 2016 21:43:18 -0700

I'm trying to understand the tokenization behavior in Lucene. When using
the StandardTokenizer in Lucene version 4.7.1, and trying to tokenize the
following string "Tokenize me!" with max token filter set to be 4, I get
only the token "me", but when using Lucene version 4.10.4, I get the
following tokens "Toke", "nize", and "me".


When debugging what's happening, I see that the scanner in version 4.10.4
reads only x number of bytes and then apply the tokenization, where x is
the max token length passed by the user. While in version 4.7.1, the
scanner fills the buffer irrespective of the max token length (it uses the
default buffer size to decide number of bytes it reads every time).

This is the commit that made the change:
https://github.com/apache/lucene-solr/commit/33204ddd895a26a56c1edd92594800ef285f0d4a

You can see in StandardTokenizer.java that this code was added and caused
this behavior:
if (scanner instanceof StandardTokenizerImpl) {
     scanner.setBufferSize(Math.min(length, 1024 * 1024)); // limit buffer
size to 1M chars
}

I also see the same code in master.

Thanks,
Sattam

p.s. Here is the code to reproduce what I'm seeing.
version 4.7.1 (using the jar files here http://archive.apache.org
/dist/lucene/java/4.7.1/)


import java.io.IOException;

import java.io.StringReader;

import org.apache.lucene.analysis.standard.StandardTokenizer;

import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

import org.apache.lucene.util.AttributeSource.AttributeFactory;

import org.apache.lucene.util.Version;


public class Test {

    public static void main(String[] args) throws IOException {

        AttributeFactory factory = AttributeFactory.DEFAULT_ATTRI
BUTE_FACTORY;

        StandardTokenizer tokenizer = new
StandardTokenizer(Version.LUCENE_47, factory, new
StringReader("Tokenize
me!"));

        tokenizer.setMaxTokenLength(4);

        tokenizer.reset();

        CharTermAttribute attr = tokenizer.addAttribute(CharTer
mAttribute.class);

        while (tokenizer.incrementToken()) {

            String term = attr.toString();

            System.out.println(term);

        }

    }

}

version 4.10.4 (using the jar files here http://archive.apache.org
/dist/lucene/java/4.10.4/)

import java.io.IOException;

import java.io.StringReader;

import org.apache.lucene.analysis.standard.StandardTokenizer;

import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

import org.apache.lucene.util.AttributeFactory;


public class Test {

    public static void main(String[] args) throws IOException {

        AttributeFactory factory = AttributeFactory.DEFAULT_ATTRI
BUTE_FACTORY;


StandardTokenizer tokenizer = new StandardTokenizer(factory, new
StringReader("Tokenize
me!"));

        tokenizer.setMaxTokenLength(4);

        tokenizer.reset();

        CharTermAttribute attr = tokenizer.addAttribute(CharTer
mAttribute.class);

        while (tokenizer.incrementToken()) {

            String term = attr.toString();

            System.out.println(term);

        }

    }

}

Erroneous tokenization behavior

Reply via email to