[
https://issues.apache.org/jira/browse/NUTCH-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171943#comment-16171943
]
Sebastian Nagel commented on NUTCH-2421:
----------------------------------------
Shouldn't the parser prioritize the correct charset. With contradictory
declarations that's not easy, maybe it's better to try to guess the charset in
this case?
> parse-html to prioritize HTML5 charset definitions
> --------------------------------------------------
>
> Key: NUTCH-2421
> URL: https://issues.apache.org/jira/browse/NUTCH-2421
> Project: Nutch
> Issue Type: Improvement
> Components: parser
> Reporter: Laurent Hervaud
> Priority: Minor
>
> jira NUTCH-1733 add support to HTML5 charset definitions.
> In some case web site declare multiple meta element with different charset :
> <meta charset="utf-8">
> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
> (ex : http://www.edga.fr/)
> In this case the second charset is detected (iso-8859-1).
> What about prioritize HTML5 charset definitions first ?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)