2010/8/10 Kostka Bořivoj <kos...@tovek.cz>:
> The problem is the bigTerm (16383 B long word) added in doc isn't returned
> as one token during indexing. StandardTokenizer splits them to set of tokens,
> each 256 bytes long. So the term isn't skipped as too long but indexed as set 
> of
> tokens. Then, of course, tested next term isn't at position 3 but at position 
> 66,
> which causes assert.

Is there a difference between CLucene and JLucene? The JLucene
StandardTokenizer skips tokens that are longer than maxTokenLength.

> To fix this StandardTokenizer has to be modified to not split on
> LUCENE_MAX_WORD_LEN limit. This seems to me as a quite a big change,
> which can break subsequent processing (eg.TokenFilters), if some piece of
> code suppose there is no token longer than LUCENE_MAX_WORD_LEN.

I agree. JLucene does similar to avoid tokens in the index longer than
a particular length.

Veit

------------------------------------------------------------------------------
This SF.net email is sponsored by 

Make an app they can't live without
Enter the BlackBerry Developer Challenge
http://p.sf.net/sfu/RIM-dev2dev 
_______________________________________________
CLucene-developers mailing list
CLucene-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/clucene-developers

Reply via email to