Snowball stemmer is not of very good quality. I think the best would be to build a lemmatizer from ispell more precisely from the ispell rules syntax. As for the language identifier the best overall language identifier is based on Ted Dunning. You can find the source code on the web.
its c code but can easily be ported to java. Also of interest is the Mozilla source code, there is code that do encoding detection. In fact I devellloped a java lib starting from that source code. Its based upon the LGPL license would you be interested to merge that source code in Lucene? -Neil -----Original Message----- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]] Sent: 7 janvier, 2003 12:06 To: [EMAIL PROTECTED] Subject: language identifier contrib Now that Doug put Snowball's stemmer's in Lucene Sandbox, it would be nice to have that language recognition contribution that somebody mentioned a month or two ago. Ah, here it is, the original email that mentions this language identifier: http://nagoya.apache.org/eyebrowse/ReadMsg?[EMAIL PROTECTED]&msgNo=2695 There's also this: http://frank.spieleck.de/ngram/ Thanks, Otis __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>