Fwd: Fetch/Dump problem: Some Chinese characters incorrect.

[EMAIL PROTECTED] Tue, 14 Oct 2008 01:25:04 -0700

And it's becoming weirder when I used "readseg -get".

The Chinese text in "parsetext" section is all correct, while the main html
page is totally messed up, both different from what I got with "readseg
-dump".

Anybody has a clue? Seems to be a SegmentReader problem, which for some
reason used shaky encoding/conversion pulling text from segments?

By the way, all the Chinese characters are in three-byte UTF-8.

---------- Forwarded message ----------
From: [EMAIL PROTECTED] <[EMAIL PROTECTED]>
Date: 2008/10/13
Subject: Fetch/Dump problem: Some Chinese characters incorrect.
To: [email protected]

I obtained some Chinese language webpages via "nutch fetch". But some
Chinese characters do not come out right after I dumped the segment back to
html pages. For instance:
http://www.dianping.com/shop/501079/
has title portion:
<head><title>
韶山冲(徐汇店)(图)_上海_大众点评网
</title>

However, I got this after dumping:
<head><title>
韶山��1¤7(徐汇庄1¤7)(��1¤7)_上海_大众点评罄1¤7
</title>

The charset specified in the page is "UTF-8". As I includeded the following
in "nutch-site.xml"
<name>parser.character.encoding.default</name>
  <value>UTF-8</value>

It makes no difference.

What could be the problem?

[image: 回复时引用此帖] <http://newreply.php?do=newreply&p=5869>

Fwd: Fetch/Dump problem: Some Chinese characters incorrect.

Reply via email to