I'm trying to understand the tokenization behavior in Lucene. When using the StandardTokenizer in Lucene version 4.7.1, and trying to tokenize the following string "Tokenize me!" with max token filter set to be 4, I get only the token "me", but when using Lucene version 4.10.4, I get the following tokens "Toke", "nize", and "me".
When debugging what's happening, I see that the scanner in version 4.10.4 reads only x number of bytes and then apply the tokenization, where x is the max token length passed by the user. While in version 4.7.1, the scanner fills the buffer irrespective of the max token length (it uses the default buffer size to decide number of bytes it reads every time). This is the commit that made the change: https://github.com/apache/lucene-solr/commit/33204ddd895a26a56c1edd92594800ef285f0d4a You can see in StandardTokenizer.java that this code was added and caused this behavior: if (scanner instanceof StandardTokenizerImpl) { scanner.setBufferSize(Math.min(length, 1024 * 1024)); // limit buffer size to 1M chars } I also see the same code in master. Thanks, Sattam p.s. Here is the code to reproduce what I'm seeing. version 4.7.1 (using the jar files here http://archive.apache.org /dist/lucene/java/4.7.1/) import java.io.IOException; import java.io.StringReader; import org.apache.lucene.analysis.standard.StandardTokenizer; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.apache.lucene.util.AttributeSource.AttributeFactory; import org.apache.lucene.util.Version; public class Test { public static void main(String[] args) throws IOException { AttributeFactory factory = AttributeFactory.DEFAULT_ATTRI BUTE_FACTORY; StandardTokenizer tokenizer = new StandardTokenizer(Version.LUCENE_47, factory, new StringReader("Tokenize me!")); tokenizer.setMaxTokenLength(4); tokenizer.reset(); CharTermAttribute attr = tokenizer.addAttribute(CharTer mAttribute.class); while (tokenizer.incrementToken()) { String term = attr.toString(); System.out.println(term); } } } version 4.10.4 (using the jar files here http://archive.apache.org /dist/lucene/java/4.10.4/) import java.io.IOException; import java.io.StringReader; import org.apache.lucene.analysis.standard.StandardTokenizer; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.apache.lucene.util.AttributeFactory; public class Test { public static void main(String[] args) throws IOException { AttributeFactory factory = AttributeFactory.DEFAULT_ATTRI BUTE_FACTORY; StandardTokenizer tokenizer = new StandardTokenizer(factory, new StringReader("Tokenize me!")); tokenizer.setMaxTokenLength(4); tokenizer.reset(); CharTermAttribute attr = tokenizer.addAttribute(CharTer mAttribute.class); while (tokenizer.incrementToken()) { String term = attr.toString(); System.out.println(term); } } }
