I guess this was a change in behavior in those versions. Are you wanting to discard the too-long terms (the 4.7.x behavior)?
Mike McCandless http://blog.mikemccandless.com On Tue, Sep 13, 2016 at 12:42 AM, Sattam Alsubaiee <[email protected]> wrote: > I'm trying to understand the tokenization behavior in Lucene. When using the > StandardTokenizer in Lucene version 4.7.1, and trying to tokenize the > following string "Tokenize me!" with max token filter set to be 4, I get > only the token "me", but when using Lucene version 4.10.4, I get the > following tokens "Toke", "nize", and "me". > > When debugging what's happening, I see that the scanner in version 4.10.4 > reads only x number of bytes and then apply the tokenization, where x is the > max token length passed by the user. While in version 4.7.1, the scanner > fills the buffer irrespective of the max token length (it uses the default > buffer size to decide number of bytes it reads every time). > > This is the commit that made the change: > https://github.com/apache/lucene-solr/commit/33204ddd895a26a56c1edd92594800ef285f0d4a > > You can see in StandardTokenizer.java that this code was added and caused > this behavior: > if (scanner instanceof StandardTokenizerImpl) { > scanner.setBufferSize(Math.min(length, 1024 * 1024)); // limit buffer > size to 1M chars > } > > I also see the same code in master. > > Thanks, > Sattam > > p.s. Here is the code to reproduce what I'm seeing. > version 4.7.1 (using the jar files here > http://archive.apache.org/dist/lucene/java/4.7.1/) > > > import java.io.IOException; > > import java.io.StringReader; > > import org.apache.lucene.analysis.standard.StandardTokenizer; > > import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; > > import org.apache.lucene.util.AttributeSource.AttributeFactory; > > import org.apache.lucene.util.Version; > > > public class Test { > > public static void main(String[] args) throws IOException { > > AttributeFactory factory = > AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY; > > StandardTokenizer tokenizer = new > StandardTokenizer(Version.LUCENE_47, factory, new StringReader("Tokenize > me!")); > > tokenizer.setMaxTokenLength(4); > > tokenizer.reset(); > > CharTermAttribute attr = > tokenizer.addAttribute(CharTermAttribute.class); > > while (tokenizer.incrementToken()) { > > String term = attr.toString(); > > System.out.println(term); > > } > > } > > } > > > version 4.10.4 (using the jar files here > http://archive.apache.org/dist/lucene/java/4.10.4/) > > import java.io.IOException; > > import java.io.StringReader; > > import org.apache.lucene.analysis.standard.StandardTokenizer; > > import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; > > import org.apache.lucene.util.AttributeFactory; > > > public class Test { > > public static void main(String[] args) throws IOException { > > AttributeFactory factory = > AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY; > > StandardTokenizer tokenizer = new StandardTokenizer(factory, new > StringReader("Tokenize me!")); > > tokenizer.setMaxTokenLength(4); > > tokenizer.reset(); > > CharTermAttribute attr = > tokenizer.addAttribute(CharTermAttribute.class); > > while (tokenizer.incrementToken()) { > > String term = attr.toString(); > > System.out.println(term); > > } > > } > > } > > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
