Now I set the jsp pages to utf-8 encoding, but the result is still the same.
Do you have any other idea? There's another problem.
If I'm writing non-english characters in the input textbox, after clicking
on the search the content of the text-box will be false, e.g. the text is
not displayed correctly.

Sorry - at this point it's a .jsp container issue. I've done battle with Resin in the past (and lost), so I don't know that I can provide much help here.

The response is definitely coming back as UTF-8, and the Hungarian characters are now showing up as ef bf bd sequences (the UTF-8 byte sequence for U+FFBD), which is what it should be if the conversion to UTF-8 can't be done.

It's almost like your .jsp page is treating the source text (from Nutch) as ASCII, and thus anything with a code point > 127 is being treated as unknown and thus unconvertible.

-- Ken


Ken Krugler ([EMAIL PROTECTED]) wrote:

 Hi Zsolt,

 >Here is the cache view :
 > >http://64.34.163.57:8080/nutch-0.9/cached.jsp?idx=0&id=0

 When I hit this with curl, I see that it's
 returning Content-Type:
 text/html;charset=iso-8859-2 in the response
 header, and the content has <meta
 http-equiv="Content-Type" content="text/html;
 charset=iso-8859-2">.

 But I see that the base href is:

 <base href="http://www.daganatok.hu/";>

 And when I hit that URL, I get back:

 < Content-Type: text/html; charset=utf-8
          <meta http-equiv="Content-Type"
 content="text/html; charset=utf-8"
 />.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'>

 The data seems to be valid UTF-8, and from my
 experience Nutch works correctly with correctly
 identified UTF-8 web pages.

 > So I'm I'm guessing the '?' come about when your
 > webapp container/server tries to convert the
 > UTF-8 data to 8859-2.
 >
 -- Ken

 >Ken Krugler ([EMAIL PROTECTED]) wrote:
 >>
 >>  >Hi All,
 >>  >
 >>  >I would like to share an issue regarding the encoding
 >>  >using Nutch 0.9.x.
 >>  >
 >>  >When I'm indexing some sites, which contains lot of
 >>  >ISO-8859-2 characters, (these are mainly eastern-european
 >>  >sites, mainly hungarian ones) then at the search page
 >>  >I cannot see the characters correcty. Even at the cached
 >>  >view, the non-english characters like áéú&#337; are visible
 >>  >as a question mark.
 >>  >
 >>  >If some of you, have an experience with this issue,
 >>  >I would be glad when some of You can help me.
 >>
 >>  What's the URL of an example page with this type of problem?
 >>
 >  > -- Ken

 --
 Ken Krugler
 Krugle, Inc.
 +1 530-210-6378
 "Find Code, Find Answers"




--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

Reply via email to