RE: Charset question

Ken Krugler Sun, 17 Sep 2006 19:11:31 -0700

I met the problem like Kenneth's

I crawl a page that the actual charset is GB18030 , but in the meta of the
page it is set to gb2312.


so, I have got some unreadable characters when parse it;

I have fixed the StringUtil.resolveEncodingAlias() following Ken's advise,

 encodingAliases.put("GB2312", "GB18030");

and I have got the message " setting encoding to GB18030"

but, it resembles evenly useless. the result appear unreadable characters
again.
It seem that the parser use the original encoding as gb2312 .

Another change I'd suggest making is to verify thatCharset.isSupported() returns true for a found alias, beforereturning that name from resolveEncodingAlias().

From what I can tell, it seems like the current implementation isbackwards - first it should look up the alias, and then make a callto check whether that alias is supported.

But a quick check on my system (JRE 1.5, Mac OS X 10.4.7) says thatGB18030 is supported, so I'm guessing that's not your problem.


-- Ken

Ken Krugler wrote:

Thanks for your reply.

I have found that the method you mentioned looks into the http header from
web server.  It looks for "charset" and does the mapping.  The apache web
server which contains the document has already  configured:

AddDefaultCharset Big5-HKSCS

The crawl engine does treat the encoding of all pages from the web server

as

Big5-HKSCS.
But the crawl engine also looks into the meta tag of the html page.
I have two identical html pages with hong kong big5 characters. One has

the

tag

<meta http-equiv="Content-Type" content="text/html; charset=Big5" />

The other

<meta http-equiv="Content-Type" content="text/html; charset=Big5-HKSCS" />

When both of these html pages are in the search result page, the "summary"
of the first one contains unreadable characters.
So I think I need to modify some codes which read the meta tag of html

page.

Do you have any idea?


  From a quick look at the source, this eventually also calls
 StringUtil.resolveEncodingAlias().

 HtmlParser.getParse() calls StringUtil.parseCharacterEncoding(),
 passing it the content-type meta data, and then takes the returned
 charset name and calls StringUtil.resolveEncodingAlias().

 So if you fix StringUtil.resolveEncodingAlias(), I think it will take
 care of both issues (HTTP server and HTML pages).

 -- Ken

-----Original Message-----

I want to do crawling on document with charset="big5-hkscs" (which is an
extension of big5, with extra hong kong chinese characters).  But the
document's meta tags set content="text/html; charset=big5" instead.  So

the

crawl engine treats the document as "big5" instead of "big-hkscs".  That
makes the extra hong kong characters unreadable on search result page.

How

my question is:  Can I force the crawl engine to treat the document as
"big5-hkscs"?


I don't know of a way to do this without some coding.

You could modify the resolveEncodingAlias method to add (or
uncomment) the aliasing of big5 => big5-hkscs, but then you'd have to
rebuild Nutch.

See the resolveEncodingAlias() method here:

http://www.krugle.com/files/svn/svn.apache.org/lucene/nutch/trunk/src/java/o
rg/apache/nutch/util/StringUtil.java



 --
 Ken Krugler
 Krugle, Inc.
 +1 530-210-6378
 "Find Code, Find Answers"

--

View this message in context:http://www.nabble.com/Charset-question-tf2231717.html#a6353390

Sent from the Nutch - User forum at Nabble.com.



--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

RE: Charset question

Reply via email to