[...]On Mon, 2 Feb 2004 20:10:57 +0100 "Jean-Francois Halleux" <[EMAIL PROTECTED]> wrote:
during the past days, I've developped such a language guesser myself as a basis for a Lucene analyzer. It is based on trigrams. It is already working but not yet in a "publishable" state. If you or others are interested I can offer the sources.
I use variable gramsize due to the toughness of detecting thelanguage of very small texts such as a query. For instance, applying bi->quadgram on the swedish sentance "Jag heter Karl" (my name is Karl) is presumed to be in Afrikaans. Using uni->quadgram does the trick.
Also, I add peneltys for gram-sized words found the the text but not in
the classified language. This improved my results even more.
And I've been considering applying markov-chains on the grams where it still is hard to guess the language, such as Afrikaans vs. Dutch and American vs. Brittish English.
Let me know if you want a copy of my code.
Here is some testoutput:
As you see, single word penalty on uni->quad does the trick on even the smallest of textstrings.
Well, perhaps it's also a matter of the quality of the language profiles. In one of my projects I'm using language profiles constructed from 1-5 -grams, with total of 300 grams per language profile. I don't do any additional tricks with penalizing the high frequency words.
If I run the above example, I get the following:
"jag heter kalle" <input> - SV: 0.7197875 <input> - DN: 0.745925 <input> - NO: 0.747225 <input> - FI: 0.755475 <input> - NL: 0.7597125 <input> - EN: 0.76746875 <input> - FR: 0.77628125 <input> - GE: 0.7785125 <input> - IT: 0.796675 <input> - PL: 0.7984875 <input> - PT: 0.7995875 <input> - ES: 0.800775 <input> - RU: 0.88500625
However, for the text "vad heter du" (what's your name) the detection fails... :-)
A question: what was your source for the representative hi-frequency words in various languages? Was it your training corpus or some publication?
-- Best regards, Andrzej Bialecki
------------------------------------------------- Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator ------------------------------------------------- FreeBSD developer (http://www.freebsd.org)
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]