RE: Automatically determin Language of document

Elie Naulleau Wed, 28 Nov 2001 01:21:13 -0800

Hi Stephan,

You'll find the example code attached to this message. The archive
contain also a model.dat file for french and english.
Remember that this is a simplisitic approach for language guessing.
It will works to distinguish between on french, english, spanish, etc
but is likely to fail between finnish, suedish, norvegian, ...etc
Porting to Java should be straightforward.


Elie

-----Message d'origine-----
De : Strittmatter Stephan (external)
[mailto:[EMAIL PROTECTED]]
Envoy� : mercredi 28 novembre 2001 08:40
� : 'Elie Naulleau'; 'Lucene Users List'
Objet : RE: Automatically determin Language of document


Hi Elie,

> You could try Doug Beeferman's variable-length character n-gram approach
> to identify a language among 13 european ones.
> http://www.dougb.com/ident.html

> If you just have 4 or 5 languages to deal with, you can build your
> own with the most frequent word lists for each language. I have some
> trivial C++ code that does it and can send it to you it you need.
> Identified language is choosen on a frequency criterion.
>

I have at the moment only two languages (en, de) but this could increase.
But I think not more than yours 4 to 5.
It would be great if you could send me your example code.
Probably I try to port it to Java.

Thanks in advance,

Stephan Strittmatter

--
To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>

simplerecolang.tgz
Description: application/compressed

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

RE: Automatically determin Language of document

Reply via email to