On 10/8/2010 11:02 PM, Veit Jahns wrote: > 2010/8/10 Kostka Bořivoj<kos...@tovek.cz>: > > Is there a difference between CLucene and JLucene? The JLucene > StandardTokenizer skips tokens that are longer than maxTokenLength. > There definitely is. CLucene has its own implementation for StandardTokenizer, and some tests I ported a while back already showed how different it is from JLucene's.
Eventually we will get to porting that as well - it isn't too difficult as most of the code is generated by a tool and therefore isn't too complex, but the main question here is priorities. As Borek indicated, we should have a more tested core first. Also, we should think about how to deal with backward compatibility. Indexes built with one type of analyzer aren't really searchable using a different one - and they do differ... >> To fix this StandardTokenizer has to be modified to not split on >> LUCENE_MAX_WORD_LEN limit. This seems to me as a quite a big change, >> which can break subsequent processing (eg.TokenFilters), if some piece of >> code suppose there is no token longer than LUCENE_MAX_WORD_LEN. >> > I agree. JLucene does similar to avoid tokens in the index longer than > a particular length. > I think the easiest solution to long terms splitting is to just port JL's StandardTokenizer, it isn't a lot of code and as I said should be fairly easy to do. But again - backward compatibility and priorities... Itamar. ------------------------------------------------------------------------------ This SF.net email is sponsored by Make an app they can't live without Enter the BlackBerry Developer Challenge http://p.sf.net/sfu/RIM-dev2dev _______________________________________________ CLucene-developers mailing list CLucene-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/clucene-developers