[
https://issues.apache.org/jira/browse/ANY23-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679995#comment-16679995
]
ASF GitHub Bot commented on ANY23-418:
--------------------------------------
Github user HansBrende commented on the issue:
https://github.com/apache/any23/pull/131
@lewismc I've added some additional unit tests which test against the main
issues we've been having with encoding detection.
Unfortunately, the only real way to comprehensively test this is to compare
against millions of webpages "in the wild", but I am confident that it
represents a huge improvement over what we have *now*, based on our past
problems with encoding detection, plus discussions over in Tika regarding the
various issues *they've* been having with encoding detection.
Compare to the original version of this file
[here](https://github.com/apache/any23/blob/bd607c1cc8c63225f9678ec967c73daa474b45aa/encoding/src/main/java/org/apache/any23/encoding/TikaEncodingDetector.java).
Since that time, I've made a couple changes to the algorithm to fix up
problems we've encountered along the way, but those tweaks weren't as
comprehensive as this one is.
Ideally, I'd like to compare this more comprehensive solution against our
original solution across millions of webpages, but I'm not yet sure how to
proceed in that regard.
> Take another look at encoding detection
> ---------------------------------------
>
> Key: ANY23-418
> URL: https://issues.apache.org/jira/browse/ANY23-418
> Project: Apache Any23
> Issue Type: Improvement
> Components: encoding
> Affects Versions: 2.3
> Reporter: Hans Brende
> Priority: Major
> Fix For: 2.3
>
>
> In order to address various shortcomings of Tika encoding detection, I've had
> to modify the TikaEncodingDetector several times. Cf. ANY23-385 and
> ANY23-411. In the former, I placed a much greater weight on detected charsets
> declared in html meta elements & xml declarations. In the latter, I placed a
> much greater weight on charsets returned from HTTP Content-Type headers.
> However, after taking a look at TIKA-539, I'm thinking I should reduce this
> added weight (for at least html meta elements), and perhaps ignore it
> altogether (unless it happens to match UTF-8, since it seems that incorrect
> declarations usually declare something *other than* UTF-8, when the correct
> charset should be UTF-8).
> Something like > 90% of all webpages use UTF-8 encoding, and all of our
> encoding detection errors to date have revolved around *something other than
> UTF-8* being detected when the correct encoding was actually UTF-8, not the
> other way around.
> Therefore, what I propose is the following:
> (1) In the absence of a Content-Type header, any declared hints that the
> charset is UTF-8 should add to the weight for UTF-8, while any declared hints
> that the charset is not UTF-8 should be ignored.
> (2) In the presence of a Content-Type header, any other declared hints should
> be ignored, unless they match UTF-8 and do not match the Content-Type header,
> in which case all hints, including the Content-Type header, should be ignored.
> EDIT: The above 2 points are a simplification of what I've actually
> implemented (specifically, I don't necessarily ignore non-UTF-8 hints). See
> PR 131 for details.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)