Hello Steve, It is always pleasure to help you develop such a great lib. Talking about StandardTokenizer and setMaxTokenLength, I think I have found another problem. It looks like when the word is longer than max length analyzer adds two tokens -> word.substring(0,maxLength) and word.substring(maxLength)
Look at this code(sorry, it is quite ugly): public class TestMaxLength { public static void main(String[] args) throws IOException { String str = getString(300); IndexWriterConfig iwc = new IndexWriterConfig (Version.LATEST, new StandardAnalyzer()); final RAMDirectory dir = new RAMDirectory(); final IndexWriter writer = new IndexWriter (dir, iwc); Document doc = new Document(); doc.add(new TextField ("", str, Field.Store.NO)); writer.addDocument (doc); IndexReader reader = DirectoryReader.open (writer, false); IndexSearcher indexSearcher = new IndexSearcher (reader); TopDocs td = indexSearcher.search(new TermQuery(new Term("", str)), 1); System.out.println("300*a: " + td.totalHits); td = indexSearcher.search(new TermQuery(new Term("", getString(255))), 1); System.out.println("255*a: " + td.totalHits); td = indexSearcher.search(new TermQuery(new Term("", getString(45))), 1); System.out.println("45*a: " + td.totalHits); System.out.println("\nTERMS"); Fields fields = MultiFields.getFields(reader); for(String field : fields) { Terms terms = fields.terms(field); TermsEnum termsEnum = terms.iterator(null); BytesRef t; while((t = termsEnum.next()) != null) { final String keyword = t.utf8ToString(); System.out.println(keyword.length() + ": " + keyword); } } } public static final String getString(int n) { StringBuilder sb = new StringBuilder(); for(int i = 0; i < n; i++) { sb.append('a'); } return sb.toString(); } } And here is the output: 300*a: 0 255*a: 1 45*a: 1 TERMS 45: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 255: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa Regards Piotr On Fri, Jul 17, 2015 at 4:40 PM, Steve Rowe <sar...@gmail.com> wrote: > Hi Piotr, > > Thanks for reporting! > > See https://issues.apache.org/jira/browse/LUCENE-6682 > > Steve > www.lucidworks.com > > > On Jul 16, 2015, at 4:47 AM, Piotr Idzikowski <piotridzikow...@gmail.com> > wrote: > > > > Hello. > > I am developing own analyzer based on StandardAnalyzer. > > I realized that tokenizer.setMaxTokenLength is called many times. > > > > *protected TokenStreamComponents createComponents(final String fieldName, > > final Reader reader) {* > > * final StandardTokenizer src = new StandardTokenizer(getVersion(), > > reader);* > > * src.setMaxTokenLength(maxTokenLength);* > > * TokenStream tok = new StandardFilter(getVersion(), src);* > > * tok = new LowerCaseFilter(getVersion(), tok);* > > * tok = new StopFilter(getVersion(), tok, stopwords);* > > * return new TokenStreamComponents(src, tok) {* > > * @Override* > > * protected void setReader(final Reader reader) throws IOException > {* > > * src.setMaxTokenLength(StandardAnalyzer.this.maxTokenLength);* > > * super.setReader(reader);* > > * }* > > * };* > > * }* > > > > Does it make sense if length stays the same? I see it finally calls this > > one( in StandardTokenizerImpl ): > > *public final void setBufferSize(int numChars) {* > > * ZZ_BUFFERSIZE = numChars;* > > * char[] newZzBuffer = new char[ZZ_BUFFERSIZE];* > > * System.arraycopy(zzBuffer, 0, newZzBuffer, 0, > > Math.min(zzBuffer.length, ZZ_BUFFERSIZE));* > > * zzBuffer = newZzBuffer;* > > * }* > > So it just copies old array content into the new one. > > > > Regards > > Piotr Idzikowski > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >