Re: [Nutch-general] Charset question

King Kong Sun, 17 Sep 2006 12:25:25 -0700

I met the problem like Kenneth's

I crawl a page that the actual charset is GB18030 , but in the meta of the
page it is set to gb2312.


so, I have got some unreadable characters when parse it;

I have fixed the StringUtil.resolveEncodingAlias() following Ken's advise, 

 encodingAliases.put("GB2312", "GB18030");

and I have got the message " setting encoding to GB18030" 

but, it resembles evenly useless. the result appear unreadable characters
again.
It seem that the parser use the original encoding as gb2312 .

Would you give me a hand ?

Thanks in advance.

King Kong


Ken Krugler wrote:
> 
>>Thanks for your reply.
>>
>>I have found that the method you mentioned looks into the http header from
>>web server.  It looks for "charset" and does the mapping.  The apache web
>>server which contains the document has already  configured:
>>
>>AddDefaultCharset Big5-HKSCS
>>
>>The crawl engine does treat the encoding of all pages from the web server
as
>>Big5-HKSCS.
>>But the crawl engine also looks into the meta tag of the html page.
>>I have two identical html pages with hong kong big5 characters. One has
the
>>tag
>>
>><meta http-equiv="Content-Type" content="text/html; charset=Big5" />
>>
>>The other
>>
>><meta http-equiv="Content-Type" content="text/html; charset=Big5-HKSCS" />
>>
>>When both of these html pages are in the search result page, the "summary"
>>of the first one contains unreadable characters.
>>So I think I need to modify some codes which read the meta tag of html
page.
>>Do you have any idea?
> 
>  From a quick look at the source, this eventually also calls 
> StringUtil.resolveEncodingAlias().
> 
> HtmlParser.getParse() calls StringUtil.parseCharacterEncoding(), 
> passing it the content-type meta data, and then takes the returned 
> charset name and calls StringUtil.resolveEncodingAlias().
> 
> So if you fix StringUtil.resolveEncodingAlias(), I think it will take 
> care of both issues (HTTP server and HTML pages).
> 
> -- Ken
> 
> 
>>-----Original Message-----
>>>I want to do crawling on document with charset="big5-hkscs" (which is an
>>>extension of big5, with extra hong kong chinese characters).  But the
>>>document's meta tags set content="text/html; charset=big5" instead.  So
the
>>>crawl engine treats the document as "big5" instead of "big-hkscs".  That
>>>makes the extra hong kong characters unreadable on search result page. 
How
>>>my question is:  Can I force the crawl engine to treat the document as
>>>"big5-hkscs"?
>>
>>I don't know of a way to do this without some coding.
>>
>>You could modify the resolveEncodingAlias method to add (or
>>uncomment) the aliasing of big5 => big5-hkscs, but then you'd have to
>>rebuild Nutch.
>>
>>See the resolveEncodingAlias() method here:
>>
>>http://www.krugle.com/files/svn/svn.apache.org/lucene/nutch/trunk/src/java/o
>>rg/apache/nutch/util/StringUtil.java
> 
> 
> -- 
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "Find Code, Find Answers"
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Charset-question-tf2231717.html#a6353390
Sent from the Nutch - User forum at Nabble.com.


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Charset question

Reply via email to