[Nutch-dev] Re: Detecting CJKV / Asian language pages

Gavin Thomas Nicol Tue, 02 Aug 2005 12:45:36 -0700


On Aug 2, 2005, at 11:55 AM, Ken Krugler wrote:

Yes - small chunks of untagged text are going to be a problem, nomatter what you do. But if you're referring to query strings froman HTML page, the default is to use the encoding of the page (whichfrom Nutch defaults to UTF-8). And you can use the accept-charsetform attribute to explicitly specify UTF-8.

Yes, that's right (FWIW. I'm one of the authors of RFC 2040)... it'dbe interesting to see how well ICU does with crawl data. Does anyonehave any experience?




-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: Detecting CJKV / Asian language pages

Reply via email to