Thanks for your reply.

I have found that the method you mentioned looks into the http header from
web server.  It looks for "charset" and does the mapping.  The apache web
server which contains the document has already  configured:

AddDefaultCharset Big5-HKSCS

The crawl engine does treat the encoding of all pages from the web server as
Big5-HKSCS.
But the crawl engine also looks into the meta tag of the html page.
I have two identical html pages with hong kong big5 characters. One has the
tag

<meta http-equiv="Content-Type" content="text/html; charset=Big5" />

The other

<meta http-equiv="Content-Type" content="text/html; charset=Big5-HKSCS" />

When both of these html pages are in the search result page, the "summary"
of the first one contains unreadable characters.
So I think I need to modify some codes which read the meta tag of html page.
Do you have any idea?

From a quick look at the source, this eventually also calls StringUtil.resolveEncodingAlias().

HtmlParser.getParse() calls StringUtil.parseCharacterEncoding(), passing it the content-type meta data, and then takes the returned charset name and calls StringUtil.resolveEncodingAlias().

So if you fix StringUtil.resolveEncodingAlias(), I think it will take care of both issues (HTTP server and HTML pages).

-- Ken


-----Original Message-----
I want to do crawling on document with charset="big5-hkscs" (which is an
extension of big5, with extra hong kong chinese characters).  But the
document's meta tags set content="text/html; charset=big5" instead.  So the
crawl engine treats the document as "big5" instead of "big-hkscs".  That
makes the extra hong kong characters unreadable on search result page.  How
my question is:  Can I force the crawl engine to treat the document as
"big5-hkscs"?

I don't know of a way to do this without some coding.

You could modify the resolveEncodingAlias method to add (or
uncomment) the aliasing of big5 => big5-hkscs, but then you'd have to
rebuild Nutch.

See the resolveEncodingAlias() method here:

http://www.krugle.com/files/svn/svn.apache.org/lucene/nutch/trunk/src/java/o
rg/apache/nutch/util/StringUtil.java


--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

Reply via email to