I met the problem like Kenneth's
I crawl a page that the actual charset is GB18030 , but in the meta of the
page it is set to gb2312.
so, I have got some unreadable characters when parse it;
I have fixed the StringUtil.resolveEncodingAlias() following Ken's advise,
encodingAliases.put("GB2312", "GB18030");
and I have got the message " setting encoding to GB18030"
but, it resembles evenly useless. the result appear unreadable characters
again.
It seem that the parser use the original encoding as gb2312 .
Another change I'd suggest making is to verify that
Charset.isSupported() returns true for a found alias, before
returning that name from resolveEncodingAlias().
From what I can tell, it seems like the current implementation is
backwards - first it should look up the alias, and then make a call
to check whether that alias is supported.
But a quick check on my system (JRE 1.5, Mac OS X 10.4.7) says that
GB18030 is supported, so I'm guessing that's not your problem.
-- Ken
Ken Krugler wrote:
Thanks for your reply.
I have found that the method you mentioned looks into the http header from
web server. It looks for "charset" and does the mapping. The apache web
server which contains the document has already configured:
AddDefaultCharset Big5-HKSCS
The crawl engine does treat the encoding of all pages from the web server
as
Big5-HKSCS.
But the crawl engine also looks into the meta tag of the html page.
I have two identical html pages with hong kong big5 characters. One has
the
tag
<meta http-equiv="Content-Type" content="text/html; charset=Big5" />
The other
<meta http-equiv="Content-Type" content="text/html; charset=Big5-HKSCS" />
When both of these html pages are in the search result page, the "summary"
of the first one contains unreadable characters.
So I think I need to modify some codes which read the meta tag of html
page.
Do you have any idea?
From a quick look at the source, this eventually also calls
StringUtil.resolveEncodingAlias().
HtmlParser.getParse() calls StringUtil.parseCharacterEncoding(),
passing it the content-type meta data, and then takes the returned
charset name and calls StringUtil.resolveEncodingAlias().
So if you fix StringUtil.resolveEncodingAlias(), I think it will take
care of both issues (HTTP server and HTML pages).
-- Ken
-----Original Message-----
I want to do crawling on document with charset="big5-hkscs" (which is an
extension of big5, with extra hong kong chinese characters). But the
document's meta tags set content="text/html; charset=big5" instead. So
the
crawl engine treats the document as "big5" instead of "big-hkscs". That
makes the extra hong kong characters unreadable on search result page.
How
my question is: Can I force the crawl engine to treat the document as
"big5-hkscs"?
I don't know of a way to do this without some coding.
You could modify the resolveEncodingAlias method to add (or
uncomment) the aliasing of big5 => big5-hkscs, but then you'd have to
rebuild Nutch.
See the resolveEncodingAlias() method here:
http://www.krugle.com/files/svn/svn.apache.org/lucene/nutch/trunk/src/java/o
rg/apache/nutch/util/StringUtil.java
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"
--
View this message in context:
http://www.nabble.com/Charset-question-tf2231717.html#a6353390
Sent from the Nutch - User forum at Nabble.com.
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"