Laurent Hervaud created NUTCH-2421:
--------------------------------------

             Summary: parse-html to prioritize HTML5 charset definitions
                 Key: NUTCH-2421
                 URL: https://issues.apache.org/jira/browse/NUTCH-2421
             Project: Nutch
          Issue Type: Improvement
          Components: parser
            Reporter: Laurent Hervaud
            Priority: Minor


jira NUTCH-1733 add support to HTML5 charset definitions.
In some case web site declare multiple meta element with different charset :
    <meta charset="utf-8">
    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> 
(ex : http://www.edga.fr/)
In this case the second charset is detected (iso-8859-1).
What about prioritize HTML5 charset definitions first ?




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to