[ https://issues.apache.org/jira/browse/NUTCH-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13943802#comment-13943802 ]
Hudson commented on NUTCH-1733: ------------------------------- SUCCESS: Integrated in Nutch-trunk #2572 (See [https://builds.apache.org/job/Nutch-trunk/2572/]) NUTCH-1733 parse-html to support HTML5 charset definitions (snagel: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1580046) * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java * /nutch/trunk/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestHtmlParser.java > parse-html to support HTML5 charset definitions > ----------------------------------------------- > > Key: NUTCH-1733 > URL: https://issues.apache.org/jira/browse/NUTCH-1733 > Project: Nutch > Issue Type: Bug > Components: parser > Affects Versions: 1.8, 2.2.1 > Reporter: Sebastian Nagel > Fix For: 2.3, 1.9 > > Attachments: NUTCH-1733-2.x.patch, NUTCH-1733-trunk.patch, > charset_bom_html5.html, charset_bom_utf16_html5.html, charset_html5.html > > > HTML 5 allows to specify the character encoding of a page per > * {{<meta charset="...">}} > * Unicode Byte Order Mark (BOM) > These are allowed in addition to previous HTTP/http-equiv Content-Type, see > [[1|http://www.w3.org/TR/2011/WD-html5-diff-20110405/#character-encoding]]. > Parse-html ignores both meta charset and BOM, falls back to the default > encoding (cp1252). Parse-tika sets the encoding appropriately. -- This message was sent by Atlassian JIRA (v6.2#6252)