>I want to do crawling on document with charset="big5-hkscs" (which is an >extension of big5, with extra hong kong chinese characters). But the >document's meta tags set content="text/html; charset=big5" instead. So the >crawl engine treats the document as "big5" instead of "big-hkscs". That >makes the extra hong kong characters unreadable on search result page. How >my question is: Can I force the crawl engine to treat the document as >"big5-hkscs"?
I don't know of a way to do this without some coding. You could modify the resolveEncodingAlias method to add (or uncomment) the aliasing of big5 => big5-hkscs, but then you'd have to rebuild Nutch. See the resolveEncodingAlias() method here: http://www.krugle.com/files/svn/svn.apache.org/lucene/nutch/trunk/src/java/org/apache/nutch/util/StringUtil.java -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 "Find Code, Find Answers" ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
