bad encoding for non-ASCII chars in cached page

Justin Yao Tue, 10 Feb 2009 16:43:58 -0800

Hi,

I am using the latest nightly build nutch 0.9-dev with defaultconfiguration.

I'm indexing some sites, some of which use ISO-8859-15 encoding.

However, I can't see the cached page characters correctly at cache viewfrom tomcat 6.0.18.

Then I dumped that segment directly using command:
"bin/nutch readseg -dump  crawl/segments/20090204054217  ./mydumpdir"
and checked the file mydumpdir/dump.

For all accent chars (such as 'ü'), they are stored correctly as UTF-8data in the "ParseText" section, but they are stored as invalid UTF-8data in "Content" section as "0xEFBFBD".It seems Nutch doesn't store those chars beyond ASCII range (0x00-0x7f)correctly in the "Content" section.If some of you, have an experience with this issue, I would be glad whensome of You can help me.


Thanks in advance.
--
Justin Yao

bad encoding for non-ASCII chars in cached page

Reply via email to