Since from time to time we have these questions/discussions about whether Lucene supports specific natural languages, I adapted a set of analyzers and filters to use the Snowball (http://snowball.tartarus.org) generated Java stemmers. This could be a good start for anybody needing to get into more detail in a particular language (like the existing Russian and German analyzers). It uses the StandardTokenizer which works fine for the other languages (except Russian).
The whole package is located at http://download.lissus.com/snowball.zip and it is about 2.3MB. The reason for this size is that it also contains all the test dictionaries for the 12 languages supported. These languages are: Danish, Dutch, English (Porter2), Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish and Swedish. Finnish has some minor problems and I wasn't able to test properly Russian since I am not familiar with character codesets. But I wouldn't bother with Russian (or German) since it is already contained in the Lucene package. As for Finnish, I am already communicating with the Snowball team and hopefully it will work in Java as well as in the other environments. Best regards, Alex ===== __________________________________ [EMAIL PROTECTED] -- http://www.lissus.com __________________________________________________ Do you Yahoo!? New DSL Internet Access from SBC & Yahoo! http://sbc.yahoo.com -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
