Hello

By "this page" you mean the original page, right?

Yes, you are correct, original page not have any information about charset.

If you mean the parser.character.encoding.default
property is set to "windows-1251",

yes, I mean "parser.character.encoding.default" in nutch-site.xml

 >I'm not sure what you mean by this...were you
able to force Nutch to generate pages using the
1251 character encoding?

I have other pages in Ukrainian. If page have charset info in head tag, all
non-russian characters show correct(seems).

I think I see the problem.

Your original web page is missing charset info, _and_ the correct charset to specify is "KOI8-U", not "windows-1251".

When Nutch analyzes the page, it's going to assume 1251 because of the parser.character.encoding.default property value, and thus its conversion to UTF-8 will be wrong for the specific character that you mention.

So then when Nutch's summary page is generated (and correctly tagged as UTF-8), you'll see an incorrect character.

-- Ken


 >-----Original Message-----
From: Ken Krugler [mailto:[EMAIL PROTECTED]
Sent: Monday, July 25, 2005 10:18 PM
To: [email protected]
Subject: Re: html parsers and windows-1251 (ukrainian)

I see incorrect characters pseudo graphics instead characters (which not
present in Russian) in summaries for Ukrainian 1251.

With Russian languages in summary all fine.

For example cached version http://search.kvitka.info/cached.jsp?idx=0
<http://search.kvitka.info/cached.jsp?idx=0&id=48679> &id=48679

On top you can find original document url and see difference :)

How can I fix that or can anybody help me with next issue?

Both pages look OK to me, though I don't read Ukrainian - sorry :)

I'm running Mac OS X 10.3 w/the Safari browser. I can send you the
screenshot of the summary if you'd like.

When I looked at the source of the original page
(http://www.prosvita.kiev.ua/posyl_m.htm) I didn't see any charset
specified. I'm guessing it should have an explicit CP 1251 in there,
versus forcing browsers to guess.

In the summary page generated by Nutch
(http://search.kvitka.info/cached.jsp?idx=0&id=48679) it explicitly
specifies UTF-8.

So my guess is that your browser either can't handle UTF-8, or you've
got it configured to assume CP 1251.

-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200


--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

Attachment converted: HD:1251.png (PNGf/«IC») (001914FD)


--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_idt77&alloc_id492&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to