2010/8/10 Kostka Bořivoj <kos...@tovek.cz>: > The problem is the bigTerm (16383 B long word) added in doc isn't returned > as one token during indexing. StandardTokenizer splits them to set of tokens, > each 256 bytes long. So the term isn't skipped as too long but indexed as set > of > tokens. Then, of course, tested next term isn't at position 3 but at position > 66, > which causes assert.
Is there a difference between CLucene and JLucene? The JLucene StandardTokenizer skips tokens that are longer than maxTokenLength. > To fix this StandardTokenizer has to be modified to not split on > LUCENE_MAX_WORD_LEN limit. This seems to me as a quite a big change, > which can break subsequent processing (eg.TokenFilters), if some piece of > code suppose there is no token longer than LUCENE_MAX_WORD_LEN. I agree. JLucene does similar to avoid tokens in the index longer than a particular length. Veit ------------------------------------------------------------------------------ This SF.net email is sponsored by Make an app they can't live without Enter the BlackBerry Developer Challenge http://p.sf.net/sfu/RIM-dev2dev _______________________________________________ CLucene-developers mailing list CLucene-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/clucene-developers