Edward Ackroyd created NUTCH-1530:
-------------------------------------

             Summary: Umlauts (üäö) garbled when fetch and parse in separate 
calls (OK when fetcher.parse is true)
                 Key: NUTCH-1530
                 URL: https://issues.apache.org/jira/browse/NUTCH-1530
             Project: Nutch
          Issue Type: Bug
          Components: parser
    Affects Versions: 2.1
         Environment: Using Cassandra-1.2.1 as data store.
            Reporter: Edward Ackroyd


When crawling http://www.spiegel.de (popular German news site) in separate 
fetch and parse calls (nutch fetch, then nutch parse, fetcher.parse=false) this 
lands in Cassandra (umlauts all garbled, for example '�' instead of 'ö'):

[default@webpage] list p;
RowKey: de.spiegel.www:http/
=> (column=c, value=SPIEGEL ONLINE - Nachrichten Schlagzeilen Hilfe RSS 
Newsletter Mobil Wetter TV-Programm Dienstag, 12. Februar 2013 SPIEGEL ONLINE 
NACHRICHTEN Home Politik Deutschland Ausland   Wirtschaft B�rse Verbraucher & 
Service Unternehmen & M�rkte Staat & Soziales Jobsuche Immowelt   Panorama 
Justiz Leute Gesellschaft Partnersuche Eurojackpot Tarifvergleiche   Sport 
Wintersport Fu�ball Bundesliga...

However, when fetcher.parse=true and the fetch call does the parsing, the 
correct umlauts land in Cassandra:

[default@webpage] list p;
RowKey: de.spiegel.www:http/
=> (column=c, value=SPIEGEL ONLINE - Nachrichten Schlagzeilen Hilfe RSS 
Newsletter Mobil Wetter TV-Programm Dienstag, 12. Februar 2013 SPIEGEL ONLINE 
NACHRICHTEN Home Politik Deutschland Ausland   Wirtschaft Börse Verbraucher & 
Service Unternehmen & Märkte Staat & Soziales Jobsuche Immowelt   Panorama 
Justiz Leute Gesellschaft Partnersuche Eurojackpot Tarifvergleiche   Sport 
Wintersport Fußball Bundesliga...


Seems the content is over-encoded when fetching/parsing in separate calls.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to