On Aug 1, 2005, at 12:25 PM, Andy Liu wrote:

The current Nutch language identifier plugin currently doesn't handle
CJKV pages.  Does anybody here have any experience with automatically
detecting the language of such pages?

I know there are specific encodings which give away what language the
page is, but for Asian language pages that use unicode or its
variants, I'm out of luck.

For Unicode it's pretty easy... just look for characters that give away the language... for example, Hiragana for Japanese, Hangul for Korean, etc.

Or you can derive the language from the host URL, if it includes a country code.

It's hard to detect all the various encodings... EUC-JP, SHIFT-JIS, ISO-2022-KR/JP, BIG5, etc. and many servers do not correctly identify the encodings.

See the latest release of ICU (3.4), which now supports charset detection.

-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to