And it's becoming weirder when I used "readseg -get". The Chinese text in "parsetext" section is all correct, while the main html page is totally messed up, both different from what I got with "readseg -dump".
Anybody has a clue? Seems to be a SegmentReader problem, which for some reason used shaky encoding/conversion pulling text from segments? By the way, all the Chinese characters are in three-byte UTF-8. ---------- Forwarded message ---------- From: [EMAIL PROTECTED] <[EMAIL PROTECTED]> Date: 2008/10/13 Subject: Fetch/Dump problem: Some Chinese characters incorrect. To: [email protected] I obtained some Chinese language webpages via "nutch fetch". But some Chinese characters do not come out right after I dumped the segment back to html pages. For instance: http://www.dianping.com/shop/501079/ has title portion: <head><title> 韶山冲(徐汇店)(图)_上海_大众点评网 </title> However, I got this after dumping: <head><title> 韶山��1¤7(徐汇庄1¤7)(��1¤7)_上海_大众点评罄1¤7 </title> The charset specified in the page is "UTF-8". As I includeded the following in "nutch-site.xml" <name>parser.character.encoding.default</name> <value>UTF-8</value> It makes no difference. What could be the problem? [image: 回复时引用此帖] <http://newreply.php?do=newreply&p=5869>
