>I want to do crawling on document with charset="big5-hkscs" (which is an
>extension of big5, with extra hong kong chinese characters).  But the
>document's meta tags set content="text/html; charset=big5" instead.  So the
>crawl engine treats the document as "big5" instead of "big-hkscs".  That
>makes the extra hong kong characters unreadable on search result page.  How
>my question is:  Can I force the crawl engine to treat the document as
>"big5-hkscs"?

I don't know of a way to do this without some coding.

You could modify the resolveEncodingAlias method to add (or 
uncomment) the aliasing of big5 => big5-hkscs, but then you'd have to 
rebuild Nutch.

See the resolveEncodingAlias() method here:

http://www.krugle.com/files/svn/svn.apache.org/lucene/nutch/trunk/src/java/org/apache/nutch/util/StringUtil.java

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to