On Aug 2, 2005, at 11:55 AM, Ken Krugler wrote:
Yes - small chunks of untagged text are going to be a problem, no
matter what you do. But if you're referring to query strings from
an HTML page, the default is to use the encoding of the page (which
from Nutch defaults to UTF-8). And you can use the accept-charset
form attribute to explicitly specify UTF-8.
Yes, that's right (FWIW. I'm one of the authors of RFC 2040)...
Thanks for your work in this area! I assume it's RFC 2070 :)
it'd be interesting to see how well ICU does with crawl data. Does
anyone have any experience?
ICU 3.4 was just released, so I don't think there's any real-world
data yet. Mozilla's charset detector has been around for a while, and
I haven't heard people complaining loudly about it (other than issues
with trying to extract it for use in other apps), but I don't monitor
those mailing lists.
Maybe Otis would be a good person to give this a try, based on his
email to the list on 7/17/2005.
He also listed a number of charset names that he was getting back
from servers, many of which weren't valid IANA names. So there are at
least three kinds of charset problems:
1. Server doesn't provide any charset info.
2. Server provides incorrect charset info.
a. Charset is a subset (e.g. 8859-1 vs. 1252)
b. Charset is just plain wrong (e.g. 8859-1 vs. 1251)
3. Server provides an invalid charset name.
a. Charset could be mapped, with a table (e.g. ".UTF8")
b. Charset is unknown (e.g. "X-USER-DEFINED").
There are other issues w/pages, for example ones that use some kind
of font hack to display "Latin" text in a specialty script - Tibetan
is a good example of this. Typically the page encoding is specified
as 8859-1 or 1252 but when you use the appropriate font it displays
Tibetan glyphs.
-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200
-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers