I need to write (PHP) code to detect the language of a given block of
text. (For my purposes I want to initially distinguish between English,
Japanese, German, Simplified Mandarin, Traditional Mandarin, Arabic,
Korean, French) I want it to be reliable so my plan was to have a list
of unicode points only found in each given language [1], and use that to
return a high confidence answer. If none found, then have a list of high
frequency words for each language [2] and use that to return a lower
confidence answer.

Like most of my i18n-related php code I'll release as BSD-license open
source. But I wondered if there already existed something I could build
on. (Or comprehensive lists of unicode points only used in certain
languages; I have some small ad hoc lists, but the more I have the more
useful the algorithm is.)

(I'm aware of letter-frequency techniques,
http://en.wikipedia.org/wiki/Letter_frequencies but haven't worked out
where that is ever more useful than word analysis?)

Darren

[1]: E.g. scharfes-s for German, katakana/hiragana for Japanese (also,
http://en.wiktionary.org/wiki/Category:Japanese-only_CJKV_Characters ).
Arabic and Korean also have unique alphabets. Accents for French.

[2]: E.g. for English "the", "be", "to", etc.
http://en.wikipedia.org/wiki/Most_common_words_in_English
Same list for German:
http://de.wikipedia.org/wiki/Liste_der_h%C3%A4ufigsten_W%C3%B6rter_der_deutschen_Sprache


-- 
Darren Cook, Software Researcher/Developer
http://dcook.org/mlsn/ (English-Japanese-German-Chinese-Arabic
                        open source dictionary/semantic network)
http://dcook.org/work/ (About me and my work)
http://darrendev.blogspot.com/ (blog on php, flash, i18n, linux, ...)

-- 
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to