Daniel Naber wrote:
The thesaurus would benefit from code that can find the base form for any word. E.g. walked -> walk, children -> child. This could be plugged into the existing thesaurus code easily, it's basically just one method like "getBaseform(String)". Of course it would need to support several languages. Some languages are very irregular, this also needs to be handled efficiently.
This is already done, Daniel. Look at hunmorph in hunspell package - this program is just a stemmer using myspell/hunspell-formatted dictionaries.
The task is to integrate the thesaurus code with an appropriate calls to functions hunmorph uses.
BTW, an easier thing to start with would be to check how the thesaurus code can be modified so it supports UTF-8. A standalone version of the thesaurus code is available at http://lingucomponent.openoffice.org/thesaurus.html
Yeah, that should be done - especially because hunspell supports multibyte characters.
Regards, Marcin --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
