[ 
https://issues.apache.org/jira/browse/TIKA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15108609#comment-15108609
 ] 

Tim Allison commented on TIKA-1837:
-----------------------------------

Thank you for raising this. I'll fix this next week unless someone else takes 
it first.

Out of curiosity, how often are you seeing this? Can you describe your corpus? 
How are you detecting botched encodings?

> HtmlEncodingDetector wrongly detects charset from commented meta
> ----------------------------------------------------------------
>
>                 Key: TIKA-1837
>                 URL: https://issues.apache.org/jira/browse/TIKA-1837
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.11
>         Environment: Any.
>            Reporter: Pascal Essiembre
>            Priority: Minor
>              Labels: easyfix, patch
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> The org.apache.tika.parser.html.HtmlEncodingDetector class will grab the 
> first meta tag that has a charset in it matching the pattern defined in 
> HTTP_META_PATTERN. The problem encountered is when there are multiple such 
> meta tags but the first ones are commented.  
> In my mind the detector should not consider commented code for this 
> detection. 
> Real example encountered in an HTML page:
> {code:xml}
>    <!--<meta http-equiv="Content-Type" content="text/html; 
> charset=ISO-8859-1"> -->
>    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
> {code}
> The detector currently detects {{ISO-8859-1}} while it should detect 
> {{utf-8}}.
> *Fix:*
> As opposed to modify the meta-detection regex, I recommend to first strip 
> comments, taking into consideration the substring from the input stream may 
> not hold the closing characters {{-->}}.  This has been tested to work:
> {code:title=HtmlEncodingDetector.java, line 104+|borderStyle=solid}
>         String head = ASCII.decode(ByteBuffer.wrap(buffer, 0, n)).toString();
>         // START FIX:
>         head = head.replaceAll("<!--.*?(-->|$)", "");
>         // END FIX
>         Matcher equiv = HTTP_META_PATTERN.matcher(head);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to