In cached version present "éÙÑ¢–Ñ¢ÈÌËÈ" - sign "Ñ¢" should be like English
character "i".

If you send me a screenshot of exactly what it should look like, I can verify that it's being displayed properly with my browser.

If you do this, it's probably best to send the image to me directly, versus posting it to the entire list.

I checked meta tags in this page it not have charset in head tag.

By "this page" you mean the original page, right? The Nutch-generated search result page has the UTF-8 charset specified.

But
default charset in nutch is 1251.

If you mean the parser.character.encoding.default property is set to "windows-1251", I believe this is only used by the HTML parser when a fetched page doesn't have any explicit charset information. I don't think it has anything to do with the encoding of pages generated by Nutch.

when head tag have charset windows-1251
Ukrainian is fine :)

I'm not sure what you mean by this...were you able to force Nutch to generate pages using the 1251 character encoding?

-- Ken


-----Original Message-----
From: Ken Krugler [mailto:[EMAIL PROTECTED]
Sent: Monday, July 25, 2005 10:18 PM
To: [email protected]
Subject: Re: html parsers and windows-1251 (ukrainian)

I see incorrect characters pseudo graphics instead characters (which not
present in Russian) in summaries for Ukrainian 1251.

With Russian languages in summary all fine.

For example cached version http://search.kvitka.info/cached.jsp?idx=0
<http://search.kvitka.info/cached.jsp?idx=0&id=48679> &id=48679

On top you can find original document url and see difference :)

How can I fix that or can anybody help me with next issue?

Both pages look OK to me, though I don't read Ukrainian - sorry :)

I'm running Mac OS X 10.3 w/the Safari browser. I can send you the
screenshot of the summary if you'd like.

When I looked at the source of the original page
(http://www.prosvita.kiev.ua/posyl_m.htm) I didn't see any charset
specified. I'm guessing it should have an explicit CP 1251 in there,
versus forcing browsers to guess.

In the summary page generated by Nutch
(http://search.kvitka.info/cached.jsp?idx=0&id=48679) it explicitly
specifies UTF-8.

So my guess is that your browser either can't handle UTF-8, or you've
got it configured to assume CP 1251.

-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200


--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

Reply via email to