I have the same problem with caching while crawling the pages in Vietnamese 
using utf-8 charset. I have digged into nutch configurations but have no idea 
how to solve.

By the way, anyone know how to force the crawler not to cache (not put the 
cache data to DB)

Here is my search 
(http://203.162.71.66:8080/search.jsp?query=%22qu%E1%BA%A3n+l%C3%BD%22&hitsPerPage=10&lang=en)
 and its cache (http://203.162.71.66:8080/cached.jsp?idx=0&id=37)

How should I do :(

Best reguards

-----Original Message-----
From: xu xiong [mailto:[EMAIL PROTECTED] 
Sent: 07 tháng sáu 2007 9:22 Sáng
To: [email protected]
Subject: ParseData encoding problem

Hi,

I use nutch 0.9 to crawl some Chinese web site, and search using nutch
web portal but found that cached html copy display incorrectly.
Then I use "bin/nutch readseg -dump" to dump segments :
ParseText(UTF-8) display correctly, but the Chinse character in
Content display incorrectly as '?'.--the original html uses gd2312
charset.

What's the possible cause? And how to fix?

Thanks in advance,
Xiong

Reply via email to