Hi Stephan, You'll find the example code attached to this message. The archive contain also a model.dat file for french and english. Remember that this is a simplisitic approach for language guessing. It will works to distinguish between on french, english, spanish, etc but is likely to fail between finnish, suedish, norvegian, ...etc Porting to Java should be straightforward.
Elie -----Message d'origine----- De : Strittmatter Stephan (external) [mailto:[EMAIL PROTECTED]] Envoy� : mercredi 28 novembre 2001 08:40 � : 'Elie Naulleau'; 'Lucene Users List' Objet : RE: Automatically determin Language of document Hi Elie, > You could try Doug Beeferman's variable-length character n-gram approach > to identify a language among 13 european ones. > http://www.dougb.com/ident.html > If you just have 4 or 5 languages to deal with, you can build your > own with the most frequent word lists for each language. I have some > trivial C++ code that does it and can send it to you it you need. > Identified language is choosen on a frequency criterion. > I have at the moment only two languages (en, de) but this could increase. But I think not more than yours 4 to 5. It would be great if you could send me your example code. Probably I try to port it to Java. Thanks in advance, Stephan Strittmatter -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
simplerecolang.tgz
Description: application/compressed
-- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
