The problem is the bigTerm (16383 B long word) added in doc isn't returned as one token during indexing. StandardTokenizer splits them to set of tokens, each 256 bytes long. So the term isn't skipped as too long but indexed as set of tokens. Then, of course, tested next term isn't at position 3 but at position 66, which causes assert.
To fix this StandardTokenizer has to be modified to not split on LUCENE_MAX_WORD_LEN limit. This seems to me as a quite a big change, which can break subsequent processing (eg.TokenFilters), if some piece of code suppose there is no token longer than LUCENE_MAX_WORD_LEN. In my opinion this is not so important to fix this immediatelly or in the near future. Lets postpone this after having stable and tested 2_3_2. What do you think? Borek ------------------------------------------------------------------------------ This SF.net email is sponsored by Make an app they can't live without Enter the BlackBerry Developer Challenge http://p.sf.net/sfu/RIM-dev2dev _______________________________________________ CLucene-developers mailing list CLucene-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/clucene-developers