>Now I set the jsp pages to utf-8 encoding, but the result is still the same. >Do you have any other idea? There's another problem. >If I'm writing non-english characters in the input textbox, after clicking >on the search the content of the text-box will be false, e.g. the text is >not displayed correctly.
Sorry - at this point it's a .jsp container issue. I've done battle with Resin in the past (and lost), so I don't know that I can provide much help here. The response is definitely coming back as UTF-8, and the Hungarian characters are now showing up as ef bf bd sequences (the UTF-8 byte sequence for U+FFBD), which is what it should be if the conversion to UTF-8 can't be done. It's almost like your .jsp page is treating the source text (from Nutch) as ASCII, and thus anything with a code point > 127 is being treated as unknown and thus unconvertible. -- Ken >Ken Krugler ([EMAIL PROTECTED]) wrote: >> >> Hi Zsolt, >> >> >Here is the cache view : > > >http://64.34.163.57:8080/nutch-0.9/cached.jsp?idx=0&id=0 >> >> When I hit this with curl, I see that it's >> returning Content-Type: >> text/html;charset=iso-8859-2 in the response >> header, and the content has <meta >> http-equiv="Content-Type" content="text/html; >> charset=iso-8859-2">. >> >> But I see that the base href is: >> >> <base href="http://www.daganatok.hu/"> >> >> And when I hit that URL, I get back: >> >> < Content-Type: text/html; charset=utf-8 >> <meta http-equiv="Content-Type" >> content="text/html; charset=utf-8" >> />.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'> >> >> The data seems to be valid UTF-8, and from my >> experience Nutch works correctly with correctly >> identified UTF-8 web pages. >> > > So I'm I'm guessing the '?' come about when your > > webapp container/server tries to convert the > > UTF-8 data to 8859-2. > > >> -- Ken >> >> >Ken Krugler ([EMAIL PROTECTED]) wrote: >> >> >> >> >Hi All, >> >> > >> >> >I would like to share an issue regarding the encoding >> >> >using Nutch 0.9.x. >> >> > >> >> >When I'm indexing some sites, which contains lot of >> >> >ISO-8859-2 characters, (these are mainly eastern-european >> >> >sites, mainly hungarian ones) then at the search page >> >> >I cannot see the characters correcty. Even at the cached >> >> >view, the non-english characters like áéúő are visible >> >> >as a question mark. >> >> > >> >> >If some of you, have an experience with this issue, >> >> >I would be glad when some of You can help me. >> >> >> >> What's the URL of an example page with this type of problem? >> >> >> > > -- Ken >> >> -- >> Ken Krugler >> Krugle, Inc. >> +1 530-210-6378 >> "Find Code, Find Answers" >> >> -- Ken Krugler Krugle, Inc. +1 530-210-6378 "Find Code, Find Answers" ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
