RE: Automatically determin Language of document

Elie Naulleau Fri, 23 Nov 2001 06:39:17 -0800


Hi,
You could try Doug Beeferman's variable-length character n-gram approach
to identify a language among 13 european ones.
http://www.dougb.com/ident.html


I tried it and it works pretty well. It's based on a similarity mesure
(cosine)
between a corpus-model and the input text.

There are issues depending on which character set you used (iso-latin1,
or other asci flavor).

If you just have 4 or 5 languages to deal with, you can build your
own with the most frequent word lists for each language. I have some
trivial C++ code that does it and can send it to you it you need.
Identified language is choosen on a frequency criterion.

Of course commercial product are available for that (try Xerox & inXight
for instance $$).

The point is how many language you have to identify...

Complement:
Some time ago, Bright Station (UK) had some open source C/C++ code
 for a variety of stemmers for european language (adapted from
the Porter stemmer approach).

I hope this helps,
Elie Naulleau
Semio-Sys


-----Message d'origine-----
De : Strittmatter Stephan (external)
[mailto:[EMAIL PROTECTED]]
Envoy� : vendredi 23 novembre 2001 14:56
� : 'Lucene Users List'
Objet : Automatically determin Language of document


Hi,

has anyone done anything to autodetect Language of an
HTML-Document which will be indexed by Lucene?

I will use Lucene to index an multilingual Portal
and want to filter the hits by language.

Thanks for any ideas,

Stephan

--
To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

RE: Automatically determin Language of document

Reply via email to