On Aug 1, 2005, at 12:25 PM, Andy Liu wrote:

The current Nutch language identifier plugin currently doesn't handle
CJKV pages.  Does anybody here have any experience with automatically
detecting the language of such pages?

I know there are specific encodings which give away what language the
page is, but for Asian language pages that use unicode or its
variants, I'm out of luck.

For Unicode it's pretty easy... just look for characters that give away the language... for example, Hiragana for Japanese, Hangul for Korean, etc.

Or you can derive the language from the host URL, if it includes a country code.

It's hard to detect all the various encodings... EUC-JP, SHIFT-JIS, ISO-2022-KR/JP, BIG5, etc. and many servers do not correctly identify the encodings.

See the latest release of ICU (3.4), which now supports charset detection.

-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

Reply via email to