Pascal Essiembre created TIKA-1837:
--------------------------------------

             Summary: HtmlEncodingDetector wrongly detects charset from 
commented meta
                 Key: TIKA-1837
                 URL: https://issues.apache.org/jira/browse/TIKA-1837
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.11
         Environment: Any.
            Reporter: Pascal Essiembre
            Priority: Minor


The org.apache.tika.parser.html.HtmlEncodingDetector class will grab the first 
meta tag that has a charset in it matching the pattern defined in 
HTTP_META_PATTERN. The problem encountered is when there are multiple such meta 
tags but the first ones are commented.  

In my mind the detector should not consider commented code for this detection. 

Real example encountered in an HTML page:

{code:xml}
   <!--<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> 
-->
   <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
{code}

The detector currently detects {{ISO-8859-1}} while it should detect {{utf-8}}.

*Fix:*

As opposed to modify the meta-detection regex, I recommend to first strip 
comments, taking into consideration the substring from the input stream may not 
hold the closing characters {{-->}}.  This has been tested to work:

{code:title=HtmlEncodingDetector.java, line 104+|borderStyle=solid}
        String head = ASCII.decode(ByteBuffer.wrap(buffer, 0, n)).toString();

        // START FIX:
        head = head.replaceAll("<!--.*?(-->|$)", "");
        // END FIX

        Matcher equiv = HTTP_META_PATTERN.matcher(head);
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to