I want to do crawling on document with charset="big5-hkscs" (which is an
extension of big5, with extra hong kong chinese characters). But the
document's meta tags set content="text/html; charset=big5" instead. So the
crawl engine treats the document as "big5" instead of "big-hkscs". That
makes the extra hong kong characters unreadable on search result page. How
my question is: Can I force the crawl engine to treat the document as
"big5-hkscs"?
I don't know of a way to do this without some coding.
You could modify the resolveEncodingAlias method to add (or
uncomment) the aliasing of big5 => big5-hkscs, but then you'd have to
rebuild Nutch.
See the resolveEncodingAlias() method here:
http://www.krugle.com/files/svn/svn.apache.org/lucene/nutch/trunk/src/java/org/apache/nutch/util/StringUtil.java
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"