[jira] [Resolved] (TIKA-1837) HtmlEncodingDetector wrongly detects charset from commented meta

Tim Allison (JIRA) Thu, 26 May 2016 07:51:49 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tim Allison resolved TIKA-1837.
-------------------------------
    Resolution: Fixed

Sorry, this one fell off my plate.  Thank you for opening this!

> HtmlEncodingDetector wrongly detects charset from commented meta
> ----------------------------------------------------------------
>
>                 Key: TIKA-1837
>                 URL: https://issues.apache.org/jira/browse/TIKA-1837
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.11
>         Environment: Any.
>            Reporter: Pascal Essiembre
>            Priority: Minor
>              Labels: easyfix, patch
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> The org.apache.tika.parser.html.HtmlEncodingDetector class will grab the 
> first meta tag that has a charset in it matching the pattern defined in 
> HTTP_META_PATTERN. The problem encountered is when there are multiple such 
> meta tags but the first ones are commented.  
> In my mind the detector should not consider commented code for this 
> detection. 
> Real example encountered in an HTML page:
> {code:xml}
>    <!--<meta http-equiv="Content-Type" content="text/html; 
> charset=ISO-8859-1"> -->
>    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
> {code}
> The detector currently detects {{ISO-8859-1}} while it should detect 
> {{utf-8}}.
> *Fix:*
> As opposed to modify the meta-detection regex, I recommend to first strip 
> comments, taking into consideration the substring from the input stream may 
> not hold the closing characters {{-->}}.  This has been tested to work:
> {code:title=HtmlEncodingDetector.java, line 104+|borderStyle=solid}
>         String head = ASCII.decode(ByteBuffer.wrap(buffer, 0, n)).toString();
>         // START FIX:
>         head = head.replaceAll("<!--.*?(-->|$)", "");
>         // END FIX
>         Matcher equiv = HTTP_META_PATTERN.matcher(head);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-1837) HtmlEncodingDetector wrongly detects charset from commented meta

Reply via email to