>Thanks for your reply.
>
>I have found that the method you mentioned looks into the http header from
>web server.  It looks for "charset" and does the mapping.  The apache web
>server which contains the document has already  configured:
>
>AddDefaultCharset Big5-HKSCS
>
>The crawl engine does treat the encoding of all pages from the web server as
>Big5-HKSCS.
>But the crawl engine also looks into the meta tag of the html page.
>I have two identical html pages with hong kong big5 characters. One has the
>tag
>
><meta http-equiv="Content-Type" content="text/html; charset=Big5" />
>
>The other
>
><meta http-equiv="Content-Type" content="text/html; charset=Big5-HKSCS" />
>
>When both of these html pages are in the search result page, the "summary"
>of the first one contains unreadable characters.
>So I think I need to modify some codes which read the meta tag of html page.
>Do you have any idea?

 From a quick look at the source, this eventually also calls 
StringUtil.resolveEncodingAlias().

HtmlParser.getParse() calls StringUtil.parseCharacterEncoding(), 
passing it the content-type meta data, and then takes the returned 
charset name and calls StringUtil.resolveEncodingAlias().

So if you fix StringUtil.resolveEncodingAlias(), I think it will take 
care of both issues (HTTP server and HTML pages).

-- Ken


>-----Original Message-----
>>I want to do crawling on document with charset="big5-hkscs" (which is an
>>extension of big5, with extra hong kong chinese characters).  But the
>>document's meta tags set content="text/html; charset=big5" instead.  So the
>>crawl engine treats the document as "big5" instead of "big-hkscs".  That
>>makes the extra hong kong characters unreadable on search result page.  How
>>my question is:  Can I force the crawl engine to treat the document as
>>"big5-hkscs"?
>
>I don't know of a way to do this without some coding.
>
>You could modify the resolveEncodingAlias method to add (or
>uncomment) the aliasing of big5 => big5-hkscs, but then you'd have to
>rebuild Nutch.
>
>See the resolveEncodingAlias() method here:
>
>http://www.krugle.com/files/svn/svn.apache.org/lucene/nutch/trunk/src/java/o
>rg/apache/nutch/util/StringUtil.java


-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to