Thanks, Steve. Sattam
On Tue, Sep 13, 2016 at 5:51 PM, Steve Rowe <[email protected]> wrote: > Hi Sattam, > > You’re right, StandardTokenizer's behavior changed (in 4.9.1/4.10) to > split long tokens at maxTokenLength rather than ignore tokens longer than > maxTokenLength. > > You can simulate the old behavior by setting maxTokenLength to the length > of the longest token you want to be able to ignore, and then adding a > LengthFilter to your analysis chain that eliminates too-long tokens. > > E.g.: > > import java.io.IOException; > import java.io.Reader; > import org.apache.lucene.analysis.Analyzer; > import org.apache.lucene.analysis.TokenStream; > import org.apache.lucene.analysis.standard.StandardTokenizer; > import org.apache.lucene.analysis.miscellaneous.LengthFilter; > import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; > > > public class Test { > public static void main(String[] args) throws IOException { > Analyzer analyzer = new Analyzer() { > @Override > protected TokenStreamComponents createComponents(String fieldName, > Reader reader) { > StandardTokenizer tokenizer = new StandardTokenizer(reader); > tokenizer.setMaxTokenLength(10); // Longest expected token: 10 > chars > return new TokenStreamComponents(tokenizer, new > LengthFilter(tokenizer, 1, 4)); > } > }; > TokenStream stream = analyzer.tokenStream("dummy", "Tokenize me!"); > stream.reset(); > CharTermAttribute termAttr = stream.addAttribute( > CharTermAttribute.class); > while (stream.incrementToken()) { > String term = termAttr.toString(); > System.out.println(term); > } > stream.end(); > stream.close(); > } > } > > -- > Steve > www.lucidworks.com > > > On Sep 13, 2016, at 7:59 PM, Sattam Alsubaiee <[email protected]> > wrote: > > > > Hi Michael, > > > > Yes, that's the desired behavior. The setMaxTokenLength method is > supposed to allow that. > > > > Cheers, > > Sattam > > > > > > On Tue, Sep 13, 2016 at 11:57 AM, Michael McCandless < > [email protected]> wrote: > > I guess this was a change in behavior in those versions. > > > > Are you wanting to discard the too-long terms (the 4.7.x behavior)? > > > > Mike McCandless > > > > http://blog.mikemccandless.com > > > > > > On Tue, Sep 13, 2016 at 12:42 AM, Sattam Alsubaiee <[email protected]> > wrote: > > > I'm trying to understand the tokenization behavior in Lucene. When > using the > > > StandardTokenizer in Lucene version 4.7.1, and trying to tokenize the > > > following string "Tokenize me!" with max token filter set to be 4, I > get > > > only the token "me", but when using Lucene version 4.10.4, I get the > > > following tokens "Toke", "nize", and "me". > > > > > > When debugging what's happening, I see that the scanner in version > 4.10.4 > > > reads only x number of bytes and then apply the tokenization, where x > is the > > > max token length passed by the user. While in version 4.7.1, the > scanner > > > fills the buffer irrespective of the max token length (it uses the > default > > > buffer size to decide number of bytes it reads every time). > > > > > > This is the commit that made the change: > > > https://github.com/apache/lucene-solr/commit/ > 33204ddd895a26a56c1edd92594800ef285f0d4a > > > > > > You can see in StandardTokenizer.java that this code was added and > caused > > > this behavior: > > > if (scanner instanceof StandardTokenizerImpl) { > > > scanner.setBufferSize(Math.min(length, 1024 * 1024)); // limit > buffer > > > size to 1M chars > > > } > > > > > > I also see the same code in master. > > > > > > Thanks, > > > Sattam > > > > > > p.s. Here is the code to reproduce what I'm seeing. > > > version 4.7.1 (using the jar files here > > > http://archive.apache.org/dist/lucene/java/4.7.1/) > > > > > > > > > import java.io.IOException; > > > > > > import java.io.StringReader; > > > > > > import org.apache.lucene.analysis.standard.StandardTokenizer; > > > > > > import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; > > > > > > import org.apache.lucene.util.AttributeSource.AttributeFactory; > > > > > > import org.apache.lucene.util.Version; > > > > > > > > > public class Test { > > > > > > public static void main(String[] args) throws IOException { > > > > > > AttributeFactory factory = > > > AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY; > > > > > > StandardTokenizer tokenizer = new > > > StandardTokenizer(Version.LUCENE_47, factory, new > StringReader("Tokenize > > > me!")); > > > > > > tokenizer.setMaxTokenLength(4); > > > > > > tokenizer.reset(); > > > > > > CharTermAttribute attr = > > > tokenizer.addAttribute(CharTermAttribute.class); > > > > > > while (tokenizer.incrementToken()) { > > > > > > String term = attr.toString(); > > > > > > System.out.println(term); > > > > > > } > > > > > > } > > > > > > } > > > > > > > > > version 4.10.4 (using the jar files here > > > http://archive.apache.org/dist/lucene/java/4.10.4/) > > > > > > import java.io.IOException; > > > > > > import java.io.StringReader; > > > > > > import org.apache.lucene.analysis.standard.StandardTokenizer; > > > > > > import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; > > > > > > import org.apache.lucene.util.AttributeFactory; > > > > > > > > > public class Test { > > > > > > public static void main(String[] args) throws IOException { > > > > > > AttributeFactory factory = > > > AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY; > > > > > > StandardTokenizer tokenizer = new StandardTokenizer(factory, > new > > > StringReader("Tokenize me!")); > > > > > > tokenizer.setMaxTokenLength(4); > > > > > > tokenizer.reset(); > > > > > > CharTermAttribute attr = > > > tokenizer.addAttribute(CharTermAttribute.class); > > > > > > while (tokenizer.incrementToken()) { > > > > > > String term = attr.toString(); > > > > > > System.out.println(term); > > > > > > } > > > > > > } > > > > > > } > > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
