Pascal Essiembre created TIKA-1837:
--------------------------------------
Summary: HtmlEncodingDetector wrongly detects charset from
commented meta
Key: TIKA-1837
URL: https://issues.apache.org/jira/browse/TIKA-1837
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.11
Environment: Any.
Reporter: Pascal Essiembre
Priority: Minor
The org.apache.tika.parser.html.HtmlEncodingDetector class will grab the first
meta tag that has a charset in it matching the pattern defined in
HTTP_META_PATTERN. The problem encountered is when there are multiple such meta
tags but the first ones are commented.
In my mind the detector should not consider commented code for this detection.
Real example encountered in an HTML page:
{code:xml}
<!--<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
-->
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
{code}
The detector currently detects {{ISO-8859-1}} while it should detect {{utf-8}}.
*Fix:*
As opposed to modify the meta-detection regex, I recommend to first strip
comments, taking into consideration the substring from the input stream may not
hold the closing characters {{-->}}. This has been tested to work:
{code:title=HtmlEncodingDetector.java, line 104+|borderStyle=solid}
String head = ASCII.decode(ByteBuffer.wrap(buffer, 0, n)).toString();
// START FIX:
head = head.replaceAll("<!--.*?(-->|$)", "");
// END FIX
Matcher equiv = HTTP_META_PATTERN.matcher(head);
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)