I need to write (PHP) code to detect the language of a given block of text. (For my purposes I want to initially distinguish between English, Japanese, German, Simplified Mandarin, Traditional Mandarin, Arabic, Korean, French) I want it to be reliable so my plan was to have a list of unicode points only found in each given language [1], and use that to return a high confidence answer. If none found, then have a list of high frequency words for each language [2] and use that to return a lower confidence answer.
Like most of my i18n-related php code I'll release as BSD-license open source. But I wondered if there already existed something I could build on. (Or comprehensive lists of unicode points only used in certain languages; I have some small ad hoc lists, but the more I have the more useful the algorithm is.) (I'm aware of letter-frequency techniques, http://en.wikipedia.org/wiki/Letter_frequencies but haven't worked out where that is ever more useful than word analysis?) Darren [1]: E.g. scharfes-s for German, katakana/hiragana for Japanese (also, http://en.wiktionary.org/wiki/Category:Japanese-only_CJKV_Characters ). Arabic and Korean also have unique alphabets. Accents for French. [2]: E.g. for English "the", "be", "to", etc. http://en.wikipedia.org/wiki/Most_common_words_in_English Same list for German: http://de.wikipedia.org/wiki/Liste_der_h%C3%A4ufigsten_W%C3%B6rter_der_deutschen_Sprache -- Darren Cook, Software Researcher/Developer http://dcook.org/mlsn/ (English-Japanese-German-Chinese-Arabic open source dictionary/semantic network) http://dcook.org/work/ (About me and my work) http://darrendev.blogspot.com/ (blog on php, flash, i18n, linux, ...) -- PHP Unicode & I18N Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php