Hi Zsolt,
Here is the cache view :
http://64.34.163.57:8080/nutch-0.9/cached.jsp?idx=0&id=0
When I hit this with curl, I see that it's
returning Content-Type:
text/html;charset=iso-8859-2 in the response
header, and the content has <meta
http-equiv="Content-Type" content="text/html;
charset=iso-8859-2">.
But I see that the base href is:
<base href="http://www.daganatok.hu/">
And when I hit that URL, I get back:
< Content-Type: text/html; charset=utf-8
<meta http-equiv="Content-Type"
content="text/html; charset=utf-8"
/>.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'>
The data seems to be valid UTF-8, and from my
experience Nutch works correctly with correctly
identified UTF-8 web pages.
So I'm I'm guessing the '?' come about when your
webapp container/server tries to convert the
UTF-8 data to 8859-2.
-- Ken
Ken Krugler ([EMAIL PROTECTED]) wrote:
>Hi All,
>
>I would like to share an issue regarding the encoding
>using Nutch 0.9.x.
>
>When I'm indexing some sites, which contains lot of
>ISO-8859-2 characters, (these are mainly eastern-european
>sites, mainly hungarian ones) then at the search page
>I cannot see the characters correcty. Even at the cached
>view, the non-english characters like áéúő are visible
>as a question mark.
>
>If some of you, have an experience with this issue,
>I would be glad when some of You can help me.
What's the URL of an example page with this type of problem?
> -- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"