[ 
https://issues.apache.org/jira/browse/TIKA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15108892#comment-15108892
 ] 

Pascal Essiembre commented on TIKA-1837:
----------------------------------------

How often? It was the first and only time I have encountered this as an issue. 
:-)

We publish an open-source crawler (Norconex HTTP Collector) and we rely on Tika 
for parsing the vast majority of documents.  This specific issue has been 
raised by one of the HTTP Collector users (see the external issue URL).  As 
such, we do not have a set corpus: it's everybody's data.


> HtmlEncodingDetector wrongly detects charset from commented meta
> ----------------------------------------------------------------
>
>                 Key: TIKA-1837
>                 URL: https://issues.apache.org/jira/browse/TIKA-1837
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.11
>         Environment: Any.
>            Reporter: Pascal Essiembre
>            Priority: Minor
>              Labels: easyfix, patch
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> The org.apache.tika.parser.html.HtmlEncodingDetector class will grab the 
> first meta tag that has a charset in it matching the pattern defined in 
> HTTP_META_PATTERN. The problem encountered is when there are multiple such 
> meta tags but the first ones are commented.  
> In my mind the detector should not consider commented code for this 
> detection. 
> Real example encountered in an HTML page:
> {code:xml}
>    <!--<meta http-equiv="Content-Type" content="text/html; 
> charset=ISO-8859-1"> -->
>    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
> {code}
> The detector currently detects {{ISO-8859-1}} while it should detect 
> {{utf-8}}.
> *Fix:*
> As opposed to modify the meta-detection regex, I recommend to first strip 
> comments, taking into consideration the substring from the input stream may 
> not hold the closing characters {{-->}}.  This has been tested to work:
> {code:title=HtmlEncodingDetector.java, line 104+|borderStyle=solid}
>         String head = ASCII.decode(ByteBuffer.wrap(buffer, 0, n)).toString();
>         // START FIX:
>         head = head.replaceAll("<!--.*?(-->|$)", "");
>         // END FIX
>         Matcher equiv = HTTP_META_PATTERN.matcher(head);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to