[
https://issues.apache.org/jira/browse/ANY23-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hans Brende reassigned ANY23-418:
---------------------------------
Assignee: Hans Brende
> Take another look at encoding detection
> ---------------------------------------
>
> Key: ANY23-418
> URL: https://issues.apache.org/jira/browse/ANY23-418
> Project: Apache Any23
> Issue Type: Improvement
> Components: encoding
> Affects Versions: 2.3
> Reporter: Hans Brende
> Assignee: Hans Brende
> Priority: Major
> Fix For: 2.3
>
>
> In order to address various shortcomings of Tika encoding detection, I've had
> to modify the TikaEncodingDetector several times. Cf. ANY23-385 and
> ANY23-411. In the former, I placed a much greater weight on detected charsets
> declared in html meta elements & xml declarations. In the latter, I placed a
> much greater weight on charsets returned from HTTP Content-Type headers.
> However, after taking a look at TIKA-539, I'm thinking I should reduce this
> added weight (for at least html meta elements), and perhaps ignore it
> altogether (unless it happens to match UTF-8, since it seems that incorrect
> declarations usually declare something *other than* UTF-8, when the correct
> charset should be UTF-8).
> Something like > 90% of all webpages use UTF-8 encoding, and all of our
> encoding detection errors to date have revolved around *something other than
> UTF-8* being detected when the correct encoding was actually UTF-8, not the
> other way around.
> Therefore, what I propose is the following:
> (1) In the absence of a Content-Type header, any declared hints that the
> charset is UTF-8 should add to the weight for UTF-8, while any declared hints
> that the charset is not UTF-8 should be ignored.
> (2) In the presence of a Content-Type header, any other declared hints should
> be ignored, unless they match UTF-8 and do not match the Content-Type header,
> in which case all hints, including the Content-Type header, should be ignored.
> EDIT: The above 2 points are a simplification of what I've actually
> implemented (specifically, I don't necessarily ignore non-UTF-8 hints). See
> PR 131 for details.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)