Santiago Pérez wrote:
Yes, I tried in that configuration file setting with the latin encoding
Windows-1250, but the value of this property does not affect to the encoding
of the content (I also tried with unexistent encoding and the result is the
same...)

<property>
  <name>parser.character.encoding.default</name>
  <value>Windows-1250</value>
  <description>The character encoding to fall back to when no other
information
  is available</description>
</property>

Has anyone had the same problem? (Hungarian o Polish people sure...)

The appearance of characters that you quoted in your other email indicates that the problem may be the opposite - your pages seem to use UTF-8, and you are trying to convert them using Windows-1250 ... Try putting UTF-8 in this property, and see what happens.

Generally speaking, pages should declare their encoding, either in HTTP headers or in <meta> tags, but often this declaration is either missing or completely wrong. Nutch uses ICU4J CharsetDetector plus its own heuristic (in util.EncodingDetector and in HtmlParser) that tries to detect character encoding if it's missing or even if it's wrong - but this is a tricky issue and sometimes results are unpredictable.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to