Hi,

I'm not sure these non-English spellcheckers, analyzers and related resources 
are good idea in real usage. English grammar is quite simple and can be 
captured in Porter's rules, but others so different. For example, Porter's 
rules can not work well in French grammar, neither in Asian languages. 
Languages libraries providing in Lucene are practical, but they are satisfy in 
real usage? Need an evaluation effort in each language.

Even you have a special language, it's still difficult to decide "satisfy 
level". You may need simply Lucene analyser in general search, but in 
sophisticate content with high quality required result, you may need a 
morpho-syntax or semantic analyzer in this context. And these analyzers are so 
expensive (in cost, and in processing time).

Develop a language spellchecker, analyzer, stemmer, ... and its dictionaries is 
still difficult and out of developer's scope. And Search Provider keeps these 
advanced features in private.

For each language, you can find out a library (mostly in research context) for 
spellchecker, analyzer. You have to integrate in Lucene.
Apache Open-NLP project is a nice effort to collect languages NLP works around 
the world, then integrate in common platform: 
http://incubator.apache.org/opennlp/

Princeton Wordnet is concept's definition, not dictionary. These is some other 
languages: http://www.globalwordnet.org/gwa/wordnet_table.htm

I'm wondering how Wordnet is useful in search context. You can may uses synsets 
(synonyms) like a suggestion dictionary. But stopwords, stem and analyzer 
dictionaries are dependant to associate modules.

Best,

-------------------
Hong-Thai
-----Message d'origine-----
De : Pulkit Singhal [mailto:pulkitsing...@gmail.com] 
Envoyé : jeudi 6 janvier 2011 17:54
À : java-user@lucene.apache.org
Objet : Where to find non-English dictionaries, thesaurus, synonyms

Hello,

What's a good source to get dictionaries (for spellcorrections) and/or
thesaurus (for synonyms) that can be used with Lucene for non-English
languages such as Fresh, Chinese, Korean etc?

For example, the wordnet contrib module is based on the data set
provided by the Princeton based wordnet system but I'm wondering where
the Lucene users go for similar reliable source for other languages?

Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to