On 10/8/2010 11:02 PM, Veit Jahns wrote:
> 2010/8/10 Kostka Bořivoj<kos...@tovek.cz>:
>    
> Is there a difference between CLucene and JLucene? The JLucene
> StandardTokenizer skips tokens that are longer than maxTokenLength.
>    
There definitely is. CLucene has its own implementation for 
StandardTokenizer, and some tests I ported a while back already showed 
how different it is from JLucene's.

Eventually we will get to porting that as well - it isn't too difficult 
as most of the code is generated by a tool and therefore isn't too 
complex, but the main question here is priorities. As Borek indicated, 
we should have a more tested core first.

Also, we should think about how to deal with backward compatibility. 
Indexes built with one type of analyzer aren't really searchable using a 
different one - and they do differ...
>> To fix this StandardTokenizer has to be modified to not split on
>> LUCENE_MAX_WORD_LEN limit. This seems to me as a quite a big change,
>> which can break subsequent processing (eg.TokenFilters), if some piece of
>> code suppose there is no token longer than LUCENE_MAX_WORD_LEN.
>>      
> I agree. JLucene does similar to avoid tokens in the index longer than
> a particular length.
>    
I think the easiest solution to long terms splitting is to just port 
JL's StandardTokenizer, it isn't a lot of code and as I said should be 
fairly easy to do. But again - backward compatibility and priorities...

Itamar.

------------------------------------------------------------------------------
This SF.net email is sponsored by 

Make an app they can't live without
Enter the BlackBerry Developer Challenge
http://p.sf.net/sfu/RIM-dev2dev 
_______________________________________________
CLucene-developers mailing list
CLucene-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/clucene-developers

Reply via email to