On Aug 1, 2005, at 12:25 PM, Andy Liu wrote:
The current Nutch language identifier plugin currently doesn't handle
CJKV pages. Does anybody here have any experience with automatically
detecting the language of such pages?
I know there are specific encodings which give away what language the
page is, but for Asian language pages that use unicode or its
variants, I'm out of luck.
For Unicode it's pretty easy... just look for characters that give
away the language... for example, Hiragana for Japanese, Hangul for
Korean, etc.
Or you can derive the language from the host URL, if it includes a
country code.
It's hard to detect all the various encodings... EUC-JP, SHIFT-JIS,
ISO-2022-KR/JP, BIG5, etc. and many servers do not correctly
identify the encodings.
See the latest release of ICU (3.4), which now supports charset detection.
-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200