I want to do crawling on document with charset="big5-hkscs" (which is an
extension of big5, with extra hong kong chinese characters).  But the
document's meta tags set content="text/html; charset=big5" instead.  So the
crawl engine treats the document as "big5" instead of "big-hkscs".  That
makes the extra hong kong characters unreadable on search result page.  How
my question is:  Can I force the crawl engine to treat the document as
"big5-hkscs"?

I don't know of a way to do this without some coding.

You could modify the resolveEncodingAlias method to add (or uncomment) the aliasing of big5 => big5-hkscs, but then you'd have to rebuild Nutch.

See the resolveEncodingAlias() method here:

http://www.krugle.com/files/svn/svn.apache.org/lucene/nutch/trunk/src/java/org/apache/nutch/util/StringUtil.java

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

Reply via email to