On Aug 2, 2005, at 11:55 AM, Ken Krugler wrote:

Yes - small chunks of untagged text are going to be a problem, no matter what you do. But if you're referring to query strings from an HTML page, the default is to use the encoding of the page (which from Nutch defaults to UTF-8). And you can use the accept-charset form attribute to explicitly specify UTF-8.

Yes, that's right (FWIW. I'm one of the authors of RFC 2040)...

Thanks for your work in this area! I assume it's RFC 2070 :)

it'd be interesting to see how well ICU does with crawl data. Does anyone have any experience?

ICU 3.4 was just released, so I don't think there's any real-world data yet. Mozilla's charset detector has been around for a while, and I haven't heard people complaining loudly about it (other than issues with trying to extract it for use in other apps), but I don't monitor those mailing lists.

Maybe Otis would be a good person to give this a try, based on his email to the list on 7/17/2005.

He also listed a number of charset names that he was getting back from servers, many of which weren't valid IANA names. So there are at least three kinds of charset problems:

1. Server doesn't provide any charset info.
2. Server provides incorrect charset info.
        a. Charset is a subset (e.g. 8859-1 vs. 1252)
        b. Charset is just plain wrong (e.g. 8859-1 vs. 1251)
3. Server provides an invalid charset name.
        a. Charset could be mapped, with a table (e.g. ".UTF8")
        b. Charset is unknown (e.g. "X-USER-DEFINED").

There are other issues w/pages, for example ones that use some kind of font hack to display "Latin" text in a specialty script - Tibetan is a good example of this. Typically the page encoding is specified as 8859-1 or 1252 but when you use the appropriate font it displays Tibetan glyphs.

-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to