On Aug 1, 2005, at 5:31 PM, Ken Krugler wrote:

Or you can derive the language from the host URL, if it includes a country code.

That's not really sufficient... many Japanese sites also have pages in English. Actually, that's true for most non-English sites from what I've seen.

It's hard to detect all the various encodings... EUC-JP, SHIFT- JIS, ISO-2022-KR/JP, BIG5, etc. and many servers do not correctly identify the encodings.


See the latest release of ICU (3.4), which now supports charset detection.

Yes, I forgot about that... but even then I wonder how well it will do. For largish blocks of text (1k or so) it's not bad... you can use statistical modelling to give you accurate probabilities, but for smallish blocks (e.g. query strings) you have a much harder time.



-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to