Re: Erroneous tokenization behavior

Sattam Alsubaiee Wed, 14 Sep 2016 09:22:38 -0700

Thanks, Steve.

Sattam


On Tue, Sep 13, 2016 at 5:51 PM, Steve Rowe <[email protected]> wrote:

> Hi Sattam,
>
> You’re right, StandardTokenizer's behavior changed (in 4.9.1/4.10) to
> split long tokens at maxTokenLength rather than ignore tokens longer than
> maxTokenLength.
>
> You can simulate the old behavior by setting maxTokenLength to the length
> of the longest token you want to be able to ignore, and then adding a
> LengthFilter to your analysis chain that eliminates too-long tokens.
>
> E.g.:
>
> import java.io.IOException;
> import java.io.Reader;
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.standard.StandardTokenizer;
> import org.apache.lucene.analysis.miscellaneous.LengthFilter;
> import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
>
>
> public class Test {
>   public static void main(String[] args) throws IOException {
>     Analyzer analyzer = new Analyzer() {
>       @Override
>       protected TokenStreamComponents createComponents(String fieldName,
> Reader reader) {
>         StandardTokenizer tokenizer = new StandardTokenizer(reader);
>         tokenizer.setMaxTokenLength(10); // Longest expected token: 10
> chars
>         return new TokenStreamComponents(tokenizer, new
> LengthFilter(tokenizer, 1, 4));
>       }
>     };
>     TokenStream stream = analyzer.tokenStream("dummy", "Tokenize me!");
>     stream.reset();
>     CharTermAttribute termAttr = stream.addAttribute(
> CharTermAttribute.class);
>     while (stream.incrementToken()) {
>       String term = termAttr.toString();
>       System.out.println(term);
>     }
>     stream.end();
>     stream.close();
>   }
> }
>
> --
> Steve
> www.lucidworks.com
>
> > On Sep 13, 2016, at 7:59 PM, Sattam Alsubaiee <[email protected]>
> wrote:
> >
> > Hi Michael,
> >
> > Yes, that's the desired behavior. The setMaxTokenLength method is
> supposed to allow that.
> >
> > Cheers,
> > Sattam
> >
> >
> > On Tue, Sep 13, 2016 at 11:57 AM, Michael McCandless <
> [email protected]> wrote:
> > I guess this was a change in behavior in those versions.
> >
> > Are you wanting to discard the too-long terms (the 4.7.x behavior)?
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Tue, Sep 13, 2016 at 12:42 AM, Sattam Alsubaiee <[email protected]>
> wrote:
> > > I'm trying to understand the tokenization behavior in Lucene. When
> using the
> > > StandardTokenizer in Lucene version 4.7.1, and trying to tokenize the
> > > following string "Tokenize me!" with max token filter set to be 4, I
> get
> > > only the token "me", but when using Lucene version 4.10.4, I get the
> > > following tokens "Toke", "nize", and "me".
> > >
> > > When debugging what's happening, I see that the scanner in version
> 4.10.4
> > > reads only x number of bytes and then apply the tokenization, where x
> is the
> > > max token length passed by the user. While in version 4.7.1, the
> scanner
> > > fills the buffer irrespective of the max token length (it uses the
> default
> > > buffer size to decide number of bytes it reads every time).
> > >
> > > This is the commit that made the change:
> > > https://github.com/apache/lucene-solr/commit/
> 33204ddd895a26a56c1edd92594800ef285f0d4a
> > >
> > > You can see in StandardTokenizer.java that this code was added and
> caused
> > > this behavior:
> > > if (scanner instanceof StandardTokenizerImpl) {
> > >      scanner.setBufferSize(Math.min(length, 1024 * 1024)); // limit
> buffer
> > > size to 1M chars
> > > }
> > >
> > > I also see the same code in master.
> > >
> > > Thanks,
> > > Sattam
> > >
> > > p.s. Here is the code to reproduce what I'm seeing.
> > > version 4.7.1 (using the jar files here
> > > http://archive.apache.org/dist/lucene/java/4.7.1/)
> > >
> > >
> > > import java.io.IOException;
> > >
> > > import java.io.StringReader;
> > >
> > > import org.apache.lucene.analysis.standard.StandardTokenizer;
> > >
> > > import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
> > >
> > > import org.apache.lucene.util.AttributeSource.AttributeFactory;
> > >
> > > import org.apache.lucene.util.Version;
> > >
> > >
> > > public class Test {
> > >
> > >     public static void main(String[] args) throws IOException {
> > >
> > >         AttributeFactory factory =
> > > AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY;
> > >
> > >         StandardTokenizer tokenizer = new
> > > StandardTokenizer(Version.LUCENE_47, factory, new
> StringReader("Tokenize
> > > me!"));
> > >
> > >         tokenizer.setMaxTokenLength(4);
> > >
> > >         tokenizer.reset();
> > >
> > >         CharTermAttribute attr =
> > > tokenizer.addAttribute(CharTermAttribute.class);
> > >
> > >         while (tokenizer.incrementToken()) {
> > >
> > >             String term = attr.toString();
> > >
> > >             System.out.println(term);
> > >
> > >         }
> > >
> > >     }
> > >
> > > }
> > >
> > >
> > > version 4.10.4 (using the jar files here
> > > http://archive.apache.org/dist/lucene/java/4.10.4/)
> > >
> > > import java.io.IOException;
> > >
> > > import java.io.StringReader;
> > >
> > > import org.apache.lucene.analysis.standard.StandardTokenizer;
> > >
> > > import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
> > >
> > > import org.apache.lucene.util.AttributeFactory;
> > >
> > >
> > > public class Test {
> > >
> > >     public static void main(String[] args) throws IOException {
> > >
> > >         AttributeFactory factory =
> > > AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY;
> > >
> > >         StandardTokenizer tokenizer = new StandardTokenizer(factory,
> new
> > > StringReader("Tokenize me!"));
> > >
> > >         tokenizer.setMaxTokenLength(4);
> > >
> > >         tokenizer.reset();
> > >
> > >         CharTermAttribute attr =
> > > tokenizer.addAttribute(CharTermAttribute.class);
> > >
> > >         while (tokenizer.incrementToken()) {
> > >
> > >             String term = attr.toString();
> > >
> > >             System.out.println(term);
> > >
> > >         }
> > >
> > >     }
> > >
> > > }
> > >
> > >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Erroneous tokenization behavior

Reply via email to