Snowball stemmer is not of very good quality. I think the best would be to build a 
lemmatizer from ispell more precisely from the ispell rules syntax. As for the 
language identifier the best overall language identifier is based on Ted Dunning. You 
can find the source code on the web. 


its c code but can easily be ported to java. Also of interest is the Mozilla source 
code, there is code that do encoding detection. In fact I devellloped a java lib 
starting from that source code. Its based upon the LGPL license would you be 
interested to merge that source code in Lucene?


-Neil



-----Original Message-----
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]]
Sent: 7 janvier, 2003 12:06
To: [EMAIL PROTECTED]
Subject: language identifier contrib


Now that Doug put Snowball's stemmer's in Lucene Sandbox, it would be
nice to have that language recognition contribution that somebody
mentioned a month or two ago.

Ah, here it is, the original email that mentions this language
identifier:
http://nagoya.apache.org/eyebrowse/ReadMsg?[EMAIL PROTECTED]&msgNo=2695

There's also this:
http://frank.spieleck.de/ngram/

Thanks,
Otis


__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to