[ https://issues.apache.org/jira/browse/NUTCH-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel reopened NUTCH-2599: ------------------------------------ Sorry, [~gbouchar]. I've missed that only the Tika parser is affected. Using the plugin parse-html worked. > charset detection issue with parse-tika > --------------------------------------- > > Key: NUTCH-2599 > URL: https://issues.apache.org/jira/browse/NUTCH-2599 > Project: Nutch > Issue Type: Bug > Components: parser > Environment: {code:java} > plugin.includes: protocol-http|parse-tika{code} > Reporter: Gerard Bouchar > Priority: Major > > Here is an example page that is displayed correctly in web browsers, but is > decoded with the wrong charset in nutch : > [https://gerardbouchar.github.io/html-encoding-example/index.html] > > This page's contents are encoded in UTF-8, it is served with HTTP headers > indicating that it is in UTF-8, but it contains a bogus HTML meta tag > indicating that is encoded in ISO-8859-1. > > This is a tricky case, but there is a [W3C specification about how to handle > it|https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding]. > It clearly states that the HTTP header (transport layer information) should > have precedence over the HTML meta tag (obtained in [byte stream > prescanning|https://html.spec.whatwg.org/multipage/parsing.html#prescan-a-byte-stream-to-determine-its-encoding]). > Browsers do respect the spec, but the tika parser doesn't. > > Looking at [the source > code|https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java], > it looks like the charset information is not even extracted from the HTTP > headers. > > {code:java} > HTTP/1.1 200 OK > Content-Type: text/html; charset=utf-8 > <!doctype html> > <html> > <head> > <meta charset="iso-8859-1"> > </head> > <body> > <a href="/">français</a> > </body> > </html> > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)