[jira] [Commented] (TIKA-1837) HtmlEncodingDetector wrongly detects charset from commented meta

Hudson (JIRA) Thu, 26 May 2016 08:17:47 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15302220#comment-15302220
 ]


Hudson commented on TIKA-1837:
------------------------------

FAILURE: Integrated in tika-2.x-windows #7 (See 
[https://builds.apache.org/job/tika-2.x-windows/7/])
TIKA-1837 -- strip comments before trying to find encoding in (tallison: rev 
43e30006d16ef4957c7aff347c9b76f3e1e163c5)
* 
tika-parser-modules/tika-parser-web-module/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java
* 
tika-parser-modules/tika-parser-web-module/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java


> HtmlEncodingDetector wrongly detects charset from commented meta
> ----------------------------------------------------------------
>
>                 Key: TIKA-1837
>                 URL: https://issues.apache.org/jira/browse/TIKA-1837
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.11
>         Environment: Any.
>            Reporter: Pascal Essiembre
>            Priority: Minor
>              Labels: easyfix, patch
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> The org.apache.tika.parser.html.HtmlEncodingDetector class will grab the 
> first meta tag that has a charset in it matching the pattern defined in 
> HTTP_META_PATTERN. The problem encountered is when there are multiple such 
> meta tags but the first ones are commented.  
> In my mind the detector should not consider commented code for this 
> detection. 
> Real example encountered in an HTML page:
> {code:xml}
>    <!--<meta http-equiv="Content-Type" content="text/html; 
> charset=ISO-8859-1"> -->
>    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
> {code}
> The detector currently detects {{ISO-8859-1}} while it should detect 
> {{utf-8}}.
> *Fix:*
> As opposed to modify the meta-detection regex, I recommend to first strip 
> comments, taking into consideration the substring from the input stream may 
> not hold the closing characters {{-->}}.  This has been tested to work:
> {code:title=HtmlEncodingDetector.java, line 104+|borderStyle=solid}
>         String head = ASCII.decode(ByteBuffer.wrap(buffer, 0, n)).toString();
>         // START FIX:
>         head = head.replaceAll("<!--.*?(-->|$)", "");
>         // END FIX
>         Matcher equiv = HTTP_META_PATTERN.matcher(head);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1837) HtmlEncodingDetector wrongly detects charset from commented meta

Reply via email to