>Thanks for your reply. > >I have found that the method you mentioned looks into the http header from >web server. It looks for "charset" and does the mapping. The apache web >server which contains the document has already configured: > >AddDefaultCharset Big5-HKSCS > >The crawl engine does treat the encoding of all pages from the web server as >Big5-HKSCS. >But the crawl engine also looks into the meta tag of the html page. >I have two identical html pages with hong kong big5 characters. One has the >tag > ><meta http-equiv="Content-Type" content="text/html; charset=Big5" /> > >The other > ><meta http-equiv="Content-Type" content="text/html; charset=Big5-HKSCS" /> > >When both of these html pages are in the search result page, the "summary" >of the first one contains unreadable characters. >So I think I need to modify some codes which read the meta tag of html page. >Do you have any idea?
From a quick look at the source, this eventually also calls StringUtil.resolveEncodingAlias(). HtmlParser.getParse() calls StringUtil.parseCharacterEncoding(), passing it the content-type meta data, and then takes the returned charset name and calls StringUtil.resolveEncodingAlias(). So if you fix StringUtil.resolveEncodingAlias(), I think it will take care of both issues (HTTP server and HTML pages). -- Ken >-----Original Message----- >>I want to do crawling on document with charset="big5-hkscs" (which is an >>extension of big5, with extra hong kong chinese characters). But the >>document's meta tags set content="text/html; charset=big5" instead. So the >>crawl engine treats the document as "big5" instead of "big-hkscs". That >>makes the extra hong kong characters unreadable on search result page. How >>my question is: Can I force the crawl engine to treat the document as >>"big5-hkscs"? > >I don't know of a way to do this without some coding. > >You could modify the resolveEncodingAlias method to add (or >uncomment) the aliasing of big5 => big5-hkscs, but then you'd have to >rebuild Nutch. > >See the resolveEncodingAlias() method here: > >http://www.krugle.com/files/svn/svn.apache.org/lucene/nutch/trunk/src/java/o >rg/apache/nutch/util/StringUtil.java -- Ken Krugler Krugle, Inc. +1 530-210-6378 "Find Code, Find Answers" ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
