Can you provide the HTTP headers and HEAD of the HTML of a Web page for which Nutch fails? Perhaps there is an inconsistency between HTTP and META headers or a mispelled codepage? Just a wild guess, but believe me -- Java does convert fine between Cp1250, Iso8859-2 and internal UTF-16 so there must be something wrong elsewhere.
Dawid On Wed, Sep 23, 2009 at 3:09 PM, MilleBii <mille...@gmail.com> wrote: > At last someone answers. > Correct CP1250. > My pages look fine in the browsers of course, but it does not mean Nutch > handles them properly. > > What I'm wondering is if the the nutch HTML parser reads them properly, > because when I do a search on such characters it fails on pages iso8859-2 or > cp1250, but not if the page is UTF-8 encoded from what I could see. > Nutch uses java String (ie Unicode) internally, but I wonder if there would > a problem in the conversion from the page encoding into the unicode > encoding. > > I did not have time to dig into the details of the matter, I wonder if any > one has come across the issue and/or solved it. > > 2009/9/23 Dawid Weiss <dawid.we...@gmail.com> > >> Polish Web sites use Cp1250 (windows-1250) or iso8859-2 (or UTF-8 of >> course). Check if diacritics like these: >> >> ęółąśćżń >> >> look all right in the above encodings and use appropriately. >> >> Dawid >> >> On Wed, Sep 16, 2009 at 4:47 PM, MilleBii <mille...@gmail.com> wrote: >> > same thing when there is >> > charset=ISO-8859-2 >> > >> > 2009/9/16 MilleBii <mille...@gmail.com> >> > >> >> Not sure where to look for explanations: >> >> >> >> I have a problem with some Polish pages which I can not index properly >> on >> >> the specific polish characters such as : >> >> Ł >> >> >> >> They are havin the following charset=windows-1252 >> >> >> >> Does the HTML parser convert them into their Unicode equivalent .... >> >> >> >> -- >> >> -MilleBii- >> >> >> > >> > >> > >> > -- >> > -MilleBii- >> > >> > > > > -- > -MilleBii- >