Hans Brende created ANY23-385:
---------------------------------

             Summary: Improve charset detection for (x)html documents
                 Key: ANY23-385
                 URL: https://issues.apache.org/jira/browse/ANY23-385
             Project: Apache Any23
          Issue Type: Improvement
          Components: encoding
    Affects Versions: 2.3
            Reporter: Hans Brende
            Assignee: Hans Brende
             Fix For: 2.3


When attempting to detect a document's encoding, our {{TikaEncodingDetector}} 
does not take into account the following elements which may occur in html/xhtml 
documents:

HTML:
{{<meta http-equiv="content-type" content="text/html; charset=xyz"/>}}

HTML5: 
{{<meta charset="xyz">}}

XHTML:
{{<?xml encoding='xyz'?>}}

In addition, the {{TikaEncodingDetector}} only sniffs the first 12000 bytes of 
the document, meaning that if, for example, the first UTF-8 encoded character 
occurs later than that, the detector may misidentify the encoding as ISO-8859-1 
or Windows-1252 instead of UTF-8 (even if UTF-8 were specified in the meta 
charset element of the page.) 

I have seen this problem occur with, e.g., the webpage 
http://losangeles.eventful.com/events/september (where the first UTF-8 encoded 
characters occurred far past the 12000 byte mark in JSON-LD content towards the 
bottom of the page, causing certain JSON-LD strings to come out looking like 
gibberish).




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to