[
https://issues.apache.org/jira/browse/TIKA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison resolved TIKA-1837.
-------------------------------
Resolution: Fixed
Sorry, this one fell off my plate. Thank you for opening this!
> HtmlEncodingDetector wrongly detects charset from commented meta
> ----------------------------------------------------------------
>
> Key: TIKA-1837
> URL: https://issues.apache.org/jira/browse/TIKA-1837
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.11
> Environment: Any.
> Reporter: Pascal Essiembre
> Priority: Minor
> Labels: easyfix, patch
> Original Estimate: 5m
> Remaining Estimate: 5m
>
> The org.apache.tika.parser.html.HtmlEncodingDetector class will grab the
> first meta tag that has a charset in it matching the pattern defined in
> HTTP_META_PATTERN. The problem encountered is when there are multiple such
> meta tags but the first ones are commented.
> In my mind the detector should not consider commented code for this
> detection.
> Real example encountered in an HTML page:
> {code:xml}
> <!--<meta http-equiv="Content-Type" content="text/html;
> charset=ISO-8859-1"> -->
> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
> {code}
> The detector currently detects {{ISO-8859-1}} while it should detect
> {{utf-8}}.
> *Fix:*
> As opposed to modify the meta-detection regex, I recommend to first strip
> comments, taking into consideration the substring from the input stream may
> not hold the closing characters {{-->}}. This has been tested to work:
> {code:title=HtmlEncodingDetector.java, line 104+|borderStyle=solid}
> String head = ASCII.decode(ByteBuffer.wrap(buffer, 0, n)).toString();
> // START FIX:
> head = head.replaceAll("<!--.*?(-->|$)", "");
> // END FIX
> Matcher equiv = HTTP_META_PATTERN.matcher(head);
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)