>I met the problem like Kenneth's
>
>I crawl a page that the actual charset is GB18030 , but in the meta of the
>page it is set to gb2312.
>
>so, I have got some unreadable characters when parse it;
>
>I have fixed the StringUtil.resolveEncodingAlias() following Ken's advise,
>
>  encodingAliases.put("GB2312", "GB18030");
>
>and I have got the message " setting encoding to GB18030"
>
>but, it resembles evenly useless. the result appear unreadable characters
>again.
>It seem that the parser use the original encoding as gb2312 .

Another change I'd suggest making is to verify that 
Charset.isSupported() returns true for a found alias, before 
returning that name from resolveEncodingAlias().

 From what I can tell, it seems like the current implementation is 
backwards - first it should look up the alias, and then make a call 
to check whether that alias is supported.

But a quick check on my system (JRE 1.5, Mac OS X 10.4.7) says that 
GB18030 is supported, so I'm guessing that's not your problem.

-- Ken


>Ken Krugler wrote:
>>
>>>Thanks for your reply.
>>>
>>>I have found that the method you mentioned looks into the http header from
>>>web server.  It looks for "charset" and does the mapping.  The apache web
>>>server which contains the document has already  configured:
>>>
>>>AddDefaultCharset Big5-HKSCS
>>>
>>>The crawl engine does treat the encoding of all pages from the web server
>as
>>>Big5-HKSCS.
>>>But the crawl engine also looks into the meta tag of the html page.
>>>I have two identical html pages with hong kong big5 characters. One has
>the
>>>tag
>>>
>>><meta http-equiv="Content-Type" content="text/html; charset=Big5" />
>>>
>>>The other
>>>
>>><meta http-equiv="Content-Type" content="text/html; charset=Big5-HKSCS" />
>>>
>>>When both of these html pages are in the search result page, the "summary"
>>>of the first one contains unreadable characters.
>>>So I think I need to modify some codes which read the meta tag of html
>page.
>>>Do you have any idea?
>>
>>   From a quick look at the source, this eventually also calls
>>  StringUtil.resolveEncodingAlias().
>>
>>  HtmlParser.getParse() calls StringUtil.parseCharacterEncoding(),
>>  passing it the content-type meta data, and then takes the returned
>>  charset name and calls StringUtil.resolveEncodingAlias().
>>
>>  So if you fix StringUtil.resolveEncodingAlias(), I think it will take
>>  care of both issues (HTTP server and HTML pages).
>>
>>  -- Ken
>>
>>
>>>-----Original Message-----
>>>>I want to do crawling on document with charset="big5-hkscs" (which is an
>>>>extension of big5, with extra hong kong chinese characters).  But the
>>>>document's meta tags set content="text/html; charset=big5" instead.  So
>the
>>>>crawl engine treats the document as "big5" instead of "big-hkscs".  That
>>>>makes the extra hong kong characters unreadable on search result page.
>How
>>>>my question is:  Can I force the crawl engine to treat the document as
>>>>"big5-hkscs"?
>>>
>>>I don't know of a way to do this without some coding.
>>>
>>>You could modify the resolveEncodingAlias method to add (or
>>>uncomment) the aliasing of big5 => big5-hkscs, but then you'd have to
>>>rebuild Nutch.
>>>
>>>See the resolveEncodingAlias() method here:
>>>
>>>http://www.krugle.com/files/svn/svn.apache.org/lucene/nutch/trunk/src/java/o
>>>rg/apache/nutch/util/StringUtil.java
>>
>>
>>  --
>>  Ken Krugler
>>  Krugle, Inc.
>>  +1 530-210-6378
>>  "Find Code, Find Answers"
>>
>>
>>
>
>--
>View this message in context: 
>http://www.nabble.com/Charset-question-tf2231717.html#a6353390
>Sent from the Nutch - User forum at Nabble.com.


-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to