Hi, You could try Doug Beeferman's variable-length character n-gram approach to identify a language among 13 european ones. http://www.dougb.com/ident.html
I tried it and it works pretty well. It's based on a similarity mesure (cosine) between a corpus-model and the input text. There are issues depending on which character set you used (iso-latin1, or other asci flavor). If you just have 4 or 5 languages to deal with, you can build your own with the most frequent word lists for each language. I have some trivial C++ code that does it and can send it to you it you need. Identified language is choosen on a frequency criterion. Of course commercial product are available for that (try Xerox & inXight for instance $$). The point is how many language you have to identify... Complement: Some time ago, Bright Station (UK) had some open source C/C++ code for a variety of stemmers for european language (adapted from the Porter stemmer approach). I hope this helps, Elie Naulleau Semio-Sys -----Message d'origine----- De : Strittmatter Stephan (external) [mailto:[EMAIL PROTECTED]] Envoy� : vendredi 23 novembre 2001 14:56 � : 'Lucene Users List' Objet : Automatically determin Language of document Hi, has anyone done anything to autodetect Language of an HTML-Document which will be indexed by Lucene? I will use Lucene to index an multilingual Portal and want to filter the hits by language. Thanks for any ideas, Stephan -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
