>Now I set the jsp pages to utf-8 encoding, but the result is still the same.
>Do you have any other idea? There's another problem.
>If I'm writing non-english characters in the input textbox, after clicking
>on the search the content of the text-box will be false, e.g. the text is
>not displayed correctly.

Sorry - at this point it's a .jsp container 
issue. I've done battle with Resin in the past 
(and lost), so I don't know that I can provide 
much help here.

The response is definitely coming back as UTF-8, 
and the Hungarian characters are now showing up 
as ef bf bd sequences (the UTF-8 byte sequence 
for U+FFBD), which is what it should be if the 
conversion to UTF-8 can't be done.

It's almost like your .jsp page is treating the 
source text (from Nutch) as ASCII, and thus 
anything with a code point > 127 is being treated 
as unknown and thus unconvertible.

-- Ken


>Ken Krugler ([EMAIL PROTECTED]) wrote:
>>
>>  Hi Zsolt,
>>
>>  >Here is the cache view :
>  > >http://64.34.163.57:8080/nutch-0.9/cached.jsp?idx=0&id=0
>>
>>  When I hit this with curl, I see that it's
>>  returning Content-Type:
>>  text/html;charset=iso-8859-2 in the response
>>  header, and the content has <meta
>>  http-equiv="Content-Type" content="text/html;
>>  charset=iso-8859-2">.
>>
>>  But I see that the base href is:
>>
>>  <base href="http://www.daganatok.hu/";>
>>
>>  And when I hit that URL, I get back:
>>
>>  < Content-Type: text/html; charset=utf-8
>>           <meta http-equiv="Content-Type"
>>  content="text/html; charset=utf-8"
>>  />.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'>
>>
>>  The data seems to be valid UTF-8, and from my
>>  experience Nutch works correctly with correctly
>>  identified UTF-8 web pages.
>>
>  > So I'm I'm guessing the '?' come about when your
>  > webapp container/server tries to convert the
>  > UTF-8 data to 8859-2.
>  >
>>  -- Ken
>>
>>  >Ken Krugler ([EMAIL PROTECTED]) wrote:
>>  >>
>>  >>  >Hi All,
>>  >>  >
>>  >>  >I would like to share an issue regarding the encoding
>>  >>  >using Nutch 0.9.x.
>>  >>  >
>>  >>  >When I'm indexing some sites, which contains lot of
>>  >>  >ISO-8859-2 characters, (these are mainly eastern-european
>>  >>  >sites, mainly hungarian ones) then at the search page
>>  >>  >I cannot see the characters correcty. Even at the cached
>>  >>  >view, the non-english characters like áéú&#337; are visible
>>  >>  >as a question mark.
>>  >>  >
>>  >>  >If some of you, have an experience with this issue,
>>  >>  >I would be glad when some of You can help me.
>>  >>
>>  >>  What's the URL of an example page with this type of problem?
>>  >>
>>  >  > -- Ken
>>
>>  --
>>  Ken Krugler
>>  Krugle, Inc.
>>  +1 530-210-6378
>>  "Find Code, Find Answers"
>>
>>


-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to