Hi,
I am using the latest nightly build nutch 0.9-dev with default
configuration.
I'm indexing some sites, some of which use ISO-8859-15 encoding.
However, I can't see the cached page characters correctly at cache view
from tomcat 6.0.18.
Then I dumped that segment directly using command:
"bin/nutch readseg -dump crawl/segments/20090204054217 ./mydumpdir"
and checked the file mydumpdir/dump.
For all accent chars (such as 'ΓΌ'), they are stored correctly as UTF-8
data in the "ParseText" section, but they are stored as invalid UTF-8
data in "Content" section as "0xEFBFBD".
It seems Nutch doesn't store those chars beyond ASCII range (0x00-0x7f)
correctly in the "Content" section.
If some of you, have an experience with this issue, I would be glad when
some of You can help me.
Thanks in advance.
--
Justin Yao