Since from time to time we have these questions/discussions about
whether Lucene supports specific natural languages, I adapted a set of
analyzers and filters to use the Snowball
(http://snowball.tartarus.org) generated Java stemmers. This could be a
good start for anybody needing to get into more detail in a particular
language (like the existing Russian and German analyzers). It uses the
StandardTokenizer which works fine for the other languages (except
Russian).

The whole package is located at http://download.lissus.com/snowball.zip
and it is about 2.3MB. The reason for this size is that it also
contains all the test dictionaries for the 12 languages supported.
These languages are: Danish, Dutch, English (Porter2), Finnish, French,
German, Italian, Norwegian, Portuguese, Russian, Spanish and Swedish.
Finnish has some minor problems and I wasn't able to test properly
Russian since I am not familiar with character codesets. But I wouldn't
bother with Russian (or German) since it is already contained in the
Lucene package. As for Finnish, I am already communicating with the
Snowball team and hopefully it will work in Java as well as in the
other environments.

Best regards,

Alex


=====
__________________________________
[EMAIL PROTECTED] -- http://www.lissus.com

__________________________________________________
Do you Yahoo!?
New DSL Internet Access from SBC & Yahoo!
http://sbc.yahoo.com

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to