Thanks for your reply.

I have found that the method you mentioned looks into the http header from
web server.  It looks for "charset" and does the mapping.  The apache web
server which contains the document has already  configured:

AddDefaultCharset Big5-HKSCS

The crawl engine does treat the encoding of all pages from the web server as
Big5-HKSCS.
But the crawl engine also looks into the meta tag of the html page.
I have two identical html pages with hong kong big5 characters. One has the
tag

<meta http-equiv="Content-Type" content="text/html; charset=Big5" />

The other

<meta http-equiv="Content-Type" content="text/html; charset=Big5-HKSCS" />

When both of these html pages are in the search result page, the "summary"
of the first one contains unreadable characters.
So I think I need to modify some codes which read the meta tag of html page.
Do you have any idea?

Thanks,
Kenneth Man

-----Original Message-----
>I want to do crawling on document with charset="big5-hkscs" (which is an
>extension of big5, with extra hong kong chinese characters).  But the
>document's meta tags set content="text/html; charset=big5" instead.  So the
>crawl engine treats the document as "big5" instead of "big-hkscs".  That
>makes the extra hong kong characters unreadable on search result page.  How
>my question is:  Can I force the crawl engine to treat the document as
>"big5-hkscs"?

I don't know of a way to do this without some coding.

You could modify the resolveEncodingAlias method to add (or 
uncomment) the aliasing of big5 => big5-hkscs, but then you'd have to 
rebuild Nutch.

See the resolveEncodingAlias() method here:

http://www.krugle.com/files/svn/svn.apache.org/lucene/nutch/trunk/src/java/o
rg/apache/nutch/util/StringUtil.java

Reply via email to