Hi,

I am using the latest nightly build nutch 0.9-dev with default configuration.
I'm indexing some sites, some of which use ISO-8859-15 encoding.
However, I can't see the cached page characters correctly at cache view from tomcat 6.0.18.
Then I dumped that segment directly using command:
"bin/nutch readseg -dump  crawl/segments/20090204054217  ./mydumpdir"
and checked the file mydumpdir/dump.
For all accent chars (such as 'ΓΌ'), they are stored correctly as UTF-8 data in the "ParseText" section, but they are stored as invalid UTF-8 data in "Content" section as "0xEFBFBD". It seems Nutch doesn't store those chars beyond ASCII range (0x00-0x7f) correctly in the "Content" section. If some of you, have an experience with this issue, I would be glad when some of You can help me.

Thanks in advance.
--
Justin Yao

Reply via email to