Re: Erroneous tokenization behavior

Michael McCandless Tue, 13 Sep 2016 11:58:08 -0700

I guess this was a change in behavior in those versions.

Are you wanting to discard the too-long terms (the 4.7.x behavior)?


Mike McCandless

http://blog.mikemccandless.com


On Tue, Sep 13, 2016 at 12:42 AM, Sattam Alsubaiee <[email protected]> wrote:
> I'm trying to understand the tokenization behavior in Lucene. When using the
> StandardTokenizer in Lucene version 4.7.1, and trying to tokenize the
> following string "Tokenize me!" with max token filter set to be 4, I get
> only the token "me", but when using Lucene version 4.10.4, I get the
> following tokens "Toke", "nize", and "me".
>
> When debugging what's happening, I see that the scanner in version 4.10.4
> reads only x number of bytes and then apply the tokenization, where x is the
> max token length passed by the user. While in version 4.7.1, the scanner
> fills the buffer irrespective of the max token length (it uses the default
> buffer size to decide number of bytes it reads every time).
>
> This is the commit that made the change:
> https://github.com/apache/lucene-solr/commit/33204ddd895a26a56c1edd92594800ef285f0d4a
>
> You can see in StandardTokenizer.java that this code was added and caused
> this behavior:
> if (scanner instanceof StandardTokenizerImpl) {
>      scanner.setBufferSize(Math.min(length, 1024 * 1024)); // limit buffer
> size to 1M chars
> }
>
> I also see the same code in master.
>
> Thanks,
> Sattam
>
> p.s. Here is the code to reproduce what I'm seeing.
> version 4.7.1 (using the jar files here
> http://archive.apache.org/dist/lucene/java/4.7.1/)
>
>
> import java.io.IOException;
>
> import java.io.StringReader;
>
> import org.apache.lucene.analysis.standard.StandardTokenizer;
>
> import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
>
> import org.apache.lucene.util.AttributeSource.AttributeFactory;
>
> import org.apache.lucene.util.Version;
>
>
> public class Test {
>
>     public static void main(String[] args) throws IOException {
>
>         AttributeFactory factory =
> AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY;
>
>         StandardTokenizer tokenizer = new
> StandardTokenizer(Version.LUCENE_47, factory, new StringReader("Tokenize
> me!"));
>
>         tokenizer.setMaxTokenLength(4);
>
>         tokenizer.reset();
>
>         CharTermAttribute attr =
> tokenizer.addAttribute(CharTermAttribute.class);
>
>         while (tokenizer.incrementToken()) {
>
>             String term = attr.toString();
>
>             System.out.println(term);
>
>         }
>
>     }
>
> }
>
>
> version 4.10.4 (using the jar files here
> http://archive.apache.org/dist/lucene/java/4.10.4/)
>
> import java.io.IOException;
>
> import java.io.StringReader;
>
> import org.apache.lucene.analysis.standard.StandardTokenizer;
>
> import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
>
> import org.apache.lucene.util.AttributeFactory;
>
>
> public class Test {
>
>     public static void main(String[] args) throws IOException {
>
>         AttributeFactory factory =
> AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY;
>
>         StandardTokenizer tokenizer = new StandardTokenizer(factory, new
> StringReader("Tokenize me!"));
>
>         tokenizer.setMaxTokenLength(4);
>
>         tokenizer.reset();
>
>         CharTermAttribute attr =
> tokenizer.addAttribute(CharTermAttribute.class);
>
>         while (tokenizer.incrementToken()) {
>
>             String term = attr.toString();
>
>             System.out.println(term);
>
>         }
>
>     }
>
> }
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Erroneous tokenization behavior

Reply via email to