[ https://issues.apache.org/jira/browse/NUTCH-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171943#comment-16171943 ]
Sebastian Nagel commented on NUTCH-2421: ---------------------------------------- Shouldn't the parser prioritize the correct charset. With contradictory declarations that's not easy, maybe it's better to try to guess the charset in this case? > parse-html to prioritize HTML5 charset definitions > -------------------------------------------------- > > Key: NUTCH-2421 > URL: https://issues.apache.org/jira/browse/NUTCH-2421 > Project: Nutch > Issue Type: Improvement > Components: parser > Reporter: Laurent Hervaud > Priority: Minor > > jira NUTCH-1733 add support to HTML5 charset definitions. > In some case web site declare multiple meta element with different charset : > <meta charset="utf-8"> > <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> > (ex : http://www.edga.fr/) > In this case the second charset is detected (iso-8859-1). > What about prioritize HTML5 charset definitions first ? -- This message was sent by Atlassian JIRA (v6.4.14#64029)