On Mon, 2 Feb 2004 20:10:57 +0100 "Jean-Francois Halleux" <[EMAIL PROTECTED]> wrote:
> during the past days, I've developped such a language guesser myself > as a basis for a Lucene analyzer. It is based on trigrams. It is > already working but not yet in a "publishable" state. If you or others > are interested I can offer the sources. I use variable gramsize due to the toughness of detecting thelanguage of very small texts such as a query. For instance, applying bi->quadgram on the swedish sentance "Jag heter Karl" (my name is Karl) is presumed to be in Afrikaans. Using uni->quadgram does the trick. Also, I add peneltys for gram-sized words found the the text but not in the classified language. This improved my results even more. And I've been considering applying markov-chains on the grams where it still is hard to guess the language, such as Afrikaans vs. Dutch and American vs. Brittish English. Let me know if you want a copy of my code. Here is some testoutput: test = "jag heter kalle." WITH SINGLE WORD PENALTYS: uni->quad-gram test has a weight of 1600 in Swedish test has a weight of 1848 in Afrikaans test has a weight of 1928 in Dutch test has a weight of 2021 in Danish test has a weight of 2011 in Norwegian bi->quad-gram test has a weight of 1024 in Swedish test has a weight of 1199 in Afrikaans test has a weight of 1356 in Dutch test has a weight of 1376 in Danish test has a weight of 1434 in Norwegian tri-gram only test has a weight of 190 in Norwegian test has a weight of 212 in Afrikaans test has a weight of 221 in Swedish test has a weight of 236 in Danish test has a weight of 237 in Dutch WITHOUT SINGLE WORD PENALTY: uni->quad-gram test has a weight of 1448 in Afrikaans test has a weight of 1528 in Dutch test has a weight of 1600 in Swedish test has a weight of 1611 in Norwegian test has a weight of 1621 in Danish bi->quad-gram test has a weight of 799 in Afrikaans test has a weight of 956 in Dutch test has a weight of 976 in Danish test has a weight of 1024 in Swedish test has a weight of 1034 in Norwegian tri-gram only test has a weight of 190 in Norwegian test has a weight of 212 in Afrikaans test has a weight of 221 in Swedish test has a weight of 236 in Danish test has a weight of 237 in Dutch As you see, single word penalty on uni->quad does the trick on even the smallest of textstrings. karl --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]