On Thu, Apr 15, 2010 at 1:30 PM, DM Smith <dmsmith...@gmail.com> wrote:

>
> Another behavior change is an upgrade in Java version. By forcing users to
> go to Java 5 with Lucene 3, the version of Unicode changed. This in itself
> causes a change in some token streams.
>
> ...

>
> It is my observation, though possibly not correct, that core only has
> rudimentary analysis capabilities, handling English very well.
>

DM brings up some interesting points here. For example, the Porter Stemmer
in core from 1970 or whenever, is essentially "frozen" to all changes for
some time now, it says so on Porter's site.

This is not the case for non-english, things are very much in flux,
including how the characters themselves are encoded on a computer. If we
want to support languages other than english in lucene, we have to make it
possible to iterate and improve things without making 20 copies of something
or scattering Version everywhere.


-- 
Robert Muir
rcm...@gmail.com

Reply via email to