Re: Erroneous tokenization behavior

Sattam Alsubaiee Tue, 13 Sep 2016 16:59:57 -0700

Hi Michael,

Yes, that's the desired behavior. The setMaxTokenLength method is supposed
to allow that.


Cheers,
Sattam


On Tue, Sep 13, 2016 at 11:57 AM, Michael McCandless <
[email protected]> wrote:

> I guess this was a change in behavior in those versions.
>
> Are you wanting to discard the too-long terms (the 4.7.x behavior)?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Sep 13, 2016 at 12:42 AM, Sattam Alsubaiee <[email protected]>
> wrote:
> > I'm trying to understand the tokenization behavior in Lucene. When using
> the
> > StandardTokenizer in Lucene version 4.7.1, and trying to tokenize the
> > following string "Tokenize me!" with max token filter set to be 4, I get
> > only the token "me", but when using Lucene version 4.10.4, I get the
> > following tokens "Toke", "nize", and "me".
> >
> > When debugging what's happening, I see that the scanner in version 4.10.4
> > reads only x number of bytes and then apply the tokenization, where x is
> the
> > max token length passed by the user. While in version 4.7.1, the scanner
> > fills the buffer irrespective of the max token length (it uses the
> default
> > buffer size to decide number of bytes it reads every time).
> >
> > This is the commit that made the change:
> > https://github.com/apache/lucene-solr/commit/
> 33204ddd895a26a56c1edd92594800ef285f0d4a
> >
> > You can see in StandardTokenizer.java that this code was added and caused
> > this behavior:
> > if (scanner instanceof StandardTokenizerImpl) {
> >      scanner.setBufferSize(Math.min(length, 1024 * 1024)); // limit
> buffer
> > size to 1M chars
> > }
> >
> > I also see the same code in master.
> >
> > Thanks,
> > Sattam
> >
> > p.s. Here is the code to reproduce what I'm seeing.
> > version 4.7.1 (using the jar files here
> > http://archive.apache.org/dist/lucene/java/4.7.1/)
> >
> >
> > import java.io.IOException;
> >
> > import java.io.StringReader;
> >
> > import org.apache.lucene.analysis.standard.StandardTokenizer;
> >
> > import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
> >
> > import org.apache.lucene.util.AttributeSource.AttributeFactory;
> >
> > import org.apache.lucene.util.Version;
> >
> >
> > public class Test {
> >
> >     public static void main(String[] args) throws IOException {
> >
> >         AttributeFactory factory =
> > AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY;
> >
> >         StandardTokenizer tokenizer = new
> > StandardTokenizer(Version.LUCENE_47, factory, new StringReader("Tokenize
> > me!"));
> >
> >         tokenizer.setMaxTokenLength(4);
> >
> >         tokenizer.reset();
> >
> >         CharTermAttribute attr =
> > tokenizer.addAttribute(CharTermAttribute.class);
> >
> >         while (tokenizer.incrementToken()) {
> >
> >             String term = attr.toString();
> >
> >             System.out.println(term);
> >
> >         }
> >
> >     }
> >
> > }
> >
> >
> > version 4.10.4 (using the jar files here
> > http://archive.apache.org/dist/lucene/java/4.10.4/)
> >
> > import java.io.IOException;
> >
> > import java.io.StringReader;
> >
> > import org.apache.lucene.analysis.standard.StandardTokenizer;
> >
> > import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
> >
> > import org.apache.lucene.util.AttributeFactory;
> >
> >
> > public class Test {
> >
> >     public static void main(String[] args) throws IOException {
> >
> >         AttributeFactory factory =
> > AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY;
> >
> >         StandardTokenizer tokenizer = new StandardTokenizer(factory, new
> > StringReader("Tokenize me!"));
> >
> >         tokenizer.setMaxTokenLength(4);
> >
> >         tokenizer.reset();
> >
> >         CharTermAttribute attr =
> > tokenizer.addAttribute(CharTermAttribute.class);
> >
> >         while (tokenizer.incrementToken()) {
> >
> >             String term = attr.toString();
> >
> >             System.out.println(term);
> >
> >         }
> >
> >     }
> >
> > }
> >
> >
>

Re: Erroneous tokenization behavior

Reply via email to