Hi, I'm not sure these non-English spellcheckers, analyzers and related resources are good idea in real usage. English grammar is quite simple and can be captured in Porter's rules, but others so different. For example, Porter's rules can not work well in French grammar, neither in Asian languages. Languages libraries providing in Lucene are practical, but they are satisfy in real usage? Need an evaluation effort in each language.
Even you have a special language, it's still difficult to decide "satisfy level". You may need simply Lucene analyser in general search, but in sophisticate content with high quality required result, you may need a morpho-syntax or semantic analyzer in this context. And these analyzers are so expensive (in cost, and in processing time). Develop a language spellchecker, analyzer, stemmer, ... and its dictionaries is still difficult and out of developer's scope. And Search Provider keeps these advanced features in private. For each language, you can find out a library (mostly in research context) for spellchecker, analyzer. You have to integrate in Lucene. Apache Open-NLP project is a nice effort to collect languages NLP works around the world, then integrate in common platform: http://incubator.apache.org/opennlp/ Princeton Wordnet is concept's definition, not dictionary. These is some other languages: http://www.globalwordnet.org/gwa/wordnet_table.htm I'm wondering how Wordnet is useful in search context. You can may uses synsets (synonyms) like a suggestion dictionary. But stopwords, stem and analyzer dictionaries are dependant to associate modules. Best, ------------------- Hong-Thai -----Message d'origine----- De : Pulkit Singhal [mailto:pulkitsing...@gmail.com] Envoyé : jeudi 6 janvier 2011 17:54 À : java-user@lucene.apache.org Objet : Where to find non-English dictionaries, thesaurus, synonyms Hello, What's a good source to get dictionaries (for spellcorrections) and/or thesaurus (for synonyms) that can be used with Lucene for non-English languages such as Fresh, Chinese, Korean etc? For example, the wordnet contrib module is based on the data set provided by the Princeton based wordnet system but I'm wondering where the Lucene users go for similar reliable source for other languages? Thanks! --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org