[
https://issues.apache.org/jira/browse/ANY23-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569597#comment-16569597
]
ASF GitHub Bot commented on ANY23-385:
--------------------------------------
Github user asfgit closed the pull request at:
https://github.com/apache/any23/pull/115
> Improve charset detection for (x)html documents
> -----------------------------------------------
>
> Key: ANY23-385
> URL: https://issues.apache.org/jira/browse/ANY23-385
> Project: Apache Any23
> Issue Type: Improvement
> Components: encoding
> Affects Versions: 2.3
> Reporter: Hans Brende
> Assignee: Hans Brende
> Priority: Major
> Fix For: 2.3
>
>
> When attempting to detect a document's encoding, our {{TikaEncodingDetector}}
> does not take into account the following elements which may occur in
> html/xhtml documents:
> HTML:
> {{<meta http-equiv="content-type" content="text/html; charset=xyz"/>}}
> HTML5:
> {{<meta charset="xyz">}}
> XHTML:
> {{<?xml encoding='xyz'?>}}
> In addition, the {{TikaEncodingDetector}} only sniffs the first 12000 bytes
> of the document, meaning that if, for example, the first UTF-8 encoded
> character occurs later than that, the detector may misidentify the encoding
> as ISO-8859-1 or Windows-1252 instead of UTF-8 (even if UTF-8 were specified
> in the meta charset element of the page.)
> I have seen this problem occur with, e.g., the webpage
> http://losangeles.eventful.com/events/september (where the UTF-8 charset was
> properly specified at the top of the page, but the first UTF-8 encoded
> characters occurred far past the 12000 byte mark in JSON-LD content towards
> the bottom of the page, causing the TikaEncodingDetector to misidentify the
> encoding as ISO-8859-1, causing certain JSON-LD strings to come out looking
> like gibberish).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)