[Nutch-dev] Re: Detecting CJKV / Asian language pages

Ken Krugler Tue, 02 Aug 2005 09:04:10 -0700

On Aug 1, 2005, at 5:31 PM, Ken Krugler wrote:
Or you can derive the language from the host URL, if it includes acountry code.
That's not really sufficient... many Japanese sites also have pagesin English. Actually, that's true for most non-English sites fromwhat I've seen.

Yes - this is just a last-gasp fallback, in case you're forced toguess. Statistically it will be better than always picking en :)

It's hard to detect all the various encodings... EUC-JP,SHIFT-JIS, ISO-2022-KR/JP, BIG5, etc. and many servers do notcorrectly identify the encodings.
See the latest release of ICU (3.4), which now supports charset detection.
Yes, I forgot about that... but even then I wonder how well it willdo. For largish blocks of text (1k or so) it's not bad... you canuse statistical modelling to give you accurate probabilities, butfor smallish blocks (e.g. query strings) you have a much harder time.

Yes - small chunks of untagged text are going to be a problem, nomatter what you do. But if you're referring to query strings from anHTML page, the default is to use the encoding of the page (which fromNutch defaults to UTF-8). And you can use the accept-charset formattribute to explicitly specify UTF-8.


-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: Detecting CJKV / Asian language pages

Reply via email to