[ 
https://issues.apache.org/jira/browse/ANY23-385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende resolved ANY23-385.
-------------------------------
    Resolution: Fixed

> Improve charset detection for (x)html documents
> -----------------------------------------------
>
>                 Key: ANY23-385
>                 URL: https://issues.apache.org/jira/browse/ANY23-385
>             Project: Apache Any23
>          Issue Type: Improvement
>          Components: encoding
>    Affects Versions: 2.3
>            Reporter: Hans Brende
>            Assignee: Hans Brende
>            Priority: Major
>             Fix For: 2.3
>
>
> When attempting to detect a document's encoding, our {{TikaEncodingDetector}} 
> does not take into account the following elements which may occur in 
> html/xhtml documents:
> HTML:
> {{<meta http-equiv="content-type" content="text/html; charset=xyz"/>}}
> HTML5: 
> {{<meta charset="xyz">}}
> XHTML:
> {{<?xml encoding='xyz'?>}}
> In addition, the {{TikaEncodingDetector}} only sniffs the first 12000 bytes 
> of the document, meaning that if, for example, the first UTF-8 encoded 
> character occurs later than that, the detector may misidentify the encoding 
> as ISO-8859-1 or Windows-1252 instead of UTF-8 (even if UTF-8 were specified 
> in the meta charset element of the page.) 
> I have seen this problem occur with, e.g., the webpage 
> http://losangeles.eventful.com/events/september (where the UTF-8 charset was 
> properly specified at the top of the page, but the first UTF-8 encoded 
> characters occurred far past the 12000 byte mark in JSON-LD content towards 
> the bottom of the page, causing the TikaEncodingDetector to misidentify the 
> encoding as ISO-8859-1, causing certain JSON-LD strings to come out looking 
> like gibberish).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to