[
https://issues.apache.org/jira/browse/NUTCH-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lewis John McGibbney updated NUTCH-1530:
----------------------------------------
Fix Version/s: 2.2
> Umlauts (üäö) garbled when fetch and parse in separate calls (OK when
> fetcher.parse is true)
> --------------------------------------------------------------------------------------------
>
> Key: NUTCH-1530
> URL: https://issues.apache.org/jira/browse/NUTCH-1530
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 2.1
> Environment: Using Cassandra-1.2.1 as data store.
> Reporter: Edward Ackroyd
> Fix For: 2.2
>
>
> When crawling http://www.spiegel.de (popular German news site) in separate
> fetch and parse calls (nutch fetch, then nutch parse, fetcher.parse=false)
> this lands in Cassandra (umlauts all garbled, for example '�' instead of
> 'ö'):
> [default@webpage] list p;
> RowKey: de.spiegel.www:http/
> => (column=c, value=SPIEGEL ONLINE - Nachrichten Schlagzeilen Hilfe RSS
> Newsletter Mobil Wetter TV-Programm Dienstag, 12. Februar 2013 SPIEGEL ONLINE
> NACHRICHTEN Home Politik Deutschland Ausland Wirtschaft B�rse Verbraucher
> & Service Unternehmen & M�rkte Staat & Soziales Jobsuche Immowelt
> Panorama Justiz Leute Gesellschaft Partnersuche Eurojackpot Tarifvergleiche
> Sport Wintersport Fu�ball Bundesliga...
> However, when fetcher.parse=true and the fetch call does the parsing, the
> correct umlauts land in Cassandra:
> [default@webpage] list p;
> RowKey: de.spiegel.www:http/
> => (column=c, value=SPIEGEL ONLINE - Nachrichten Schlagzeilen Hilfe RSS
> Newsletter Mobil Wetter TV-Programm Dienstag, 12. Februar 2013 SPIEGEL ONLINE
> NACHRICHTEN Home Politik Deutschland Ausland Wirtschaft Börse Verbraucher &
> Service Unternehmen & Märkte Staat & Soziales Jobsuche Immowelt Panorama
> Justiz Leute Gesellschaft Partnersuche Eurojackpot Tarifvergleiche Sport
> Wintersport Fußball Bundesliga...
> Seems the content is over-encoded when fetching/parsing in separate calls.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira