Hans Brende created ANY23-385:
---------------------------------
Summary: Improve charset detection for (x)html documents
Key: ANY23-385
URL: https://issues.apache.org/jira/browse/ANY23-385
Project: Apache Any23
Issue Type: Improvement
Components: encoding
Affects Versions: 2.3
Reporter: Hans Brende
Assignee: Hans Brende
Fix For: 2.3
When attempting to detect a document's encoding, our {{TikaEncodingDetector}}
does not take into account the following elements which may occur in html/xhtml
documents:
HTML:
{{<meta http-equiv="content-type" content="text/html; charset=xyz"/>}}
HTML5:
{{<meta charset="xyz">}}
XHTML:
{{<?xml encoding='xyz'?>}}
In addition, the {{TikaEncodingDetector}} only sniffs the first 12000 bytes of
the document, meaning that if, for example, the first UTF-8 encoded character
occurs later than that, the detector may misidentify the encoding as ISO-8859-1
or Windows-1252 instead of UTF-8 (even if UTF-8 were specified in the meta
charset element of the page.)
I have seen this problem occur with, e.g., the webpage
http://losangeles.eventful.com/events/september (where the first UTF-8 encoded
characters occurred far past the 12000 byte mark in JSON-LD content towards the
bottom of the page, causing certain JSON-LD strings to come out looking like
gibberish).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)