karl wettin wrote:
On Mon, 2 Feb 2004 20:10:57 +0100
"Jean-Francois Halleux" <[EMAIL PROTECTED]> wrote:


during the past days, I've developped such a language guesser myself
as a basis for a Lucene analyzer. It is based on trigrams. It is
already working but not yet in a "publishable" state. If you or others
are interested I can offer the sources.


I use variable gramsize due to the toughness of detecting thelanguage of
very small texts such as a query. For instance, applying bi->quadgram on
the swedish sentance "Jag heter Karl" (my name is Karl) is presumed to
be in Afrikaans. Using uni->quadgram does the trick.

Also, I add peneltys for gram-sized words found the the text but not in
the classified language. This improved my results even more.


And I've been considering applying markov-chains on the grams where it
still is hard to guess the language, such as Afrikaans vs. Dutch and
American vs. Brittish English.

Let me know if you want a copy of my code.


Here is some testoutput:


[...]
As you see, single word penalty on uni->quad does the trick on even the
smallest of textstrings.

Well, perhaps it's also a matter of the quality of the language profiles. In one of my projects I'm using language profiles constructed from 1-5 -grams, with total of 300 grams per language profile. I don't do any additional tricks with penalizing the high frequency words.


If I run the above example, I get the following:

 "jag heter kalle"
<input> - SV:   0.7197875
<input> - DN:   0.745925
<input> - NO:   0.747225
<input> - FI:   0.755475
<input> - NL:   0.7597125
<input> - EN:   0.76746875
<input> - FR:   0.77628125
<input> - GE:   0.7785125
<input> - IT:   0.796675
<input> - PL:   0.7984875
<input> - PT:   0.7995875
<input> - ES:   0.800775
<input> - RU:   0.88500625

However, for the text "vad heter du" (what's your name) the detection fails... :-)

A question: what was your source for the representative hi-frequency words in various languages? Was it your training corpus or some publication?

--
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to