The problem is the bigTerm (16383 B long word) added in doc isn't returned as 
one token during indexing. StandardTokenizer splits them to set of tokens, each 
256 bytes long. So the term isn't skipped as too long but indexed as set of 
tokens. Then, of course, tested next term isn't at position 3 but at position 
66, which causes assert.

To fix this StandardTokenizer has to be modified to not split on 
LUCENE_MAX_WORD_LEN limit. This seems to me as a quite a big change, which can 
break subsequent processing (eg.TokenFilters), if some piece of code suppose 
there is no token longer than LUCENE_MAX_WORD_LEN.

In my opinion this is not so important to fix this immediatelly or in the near 
future. Lets postpone this after having stable and tested 2_3_2. What do you 
think?

Borek


------------------------------------------------------------------------------
This SF.net email is sponsored by 

Make an app they can't live without
Enter the BlackBerry Developer Challenge
http://p.sf.net/sfu/RIM-dev2dev 
_______________________________________________
CLucene-developers mailing list
CLucene-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/clucene-developers

Reply via email to