On Thu, Apr 15, 2010 at 1:30 PM, DM Smith <dmsmith...@gmail.com> wrote:
> > Another behavior change is an upgrade in Java version. By forcing users to > go to Java 5 with Lucene 3, the version of Unicode changed. This in itself > causes a change in some token streams. > > ... > > It is my observation, though possibly not correct, that core only has > rudimentary analysis capabilities, handling English very well. > DM brings up some interesting points here. For example, the Porter Stemmer in core from 1970 or whenever, is essentially "frozen" to all changes for some time now, it says so on Porter's site. This is not the case for non-english, things are very much in flux, including how the characters themselves are encoded on a computer. If we want to support languages other than english in lucene, we have to make it possible to iterate and improve things without making 20 copies of something or scattering Version everywhere. -- Robert Muir rcm...@gmail.com