Github user HansBrende commented on the issue:
https://github.com/apache/any23/pull/131
@lewismc I've added some additional unit tests which test against the main
issues we've been having with encoding detection.
Unfortunately, the only real way to comprehensively test this is to compare
against millions of webpages "in the wild", but I am confident that it
represents a huge improvement over what we have *now*, based on our past
problems with encoding detection, plus discussions over in Tika regarding the
various issues *they've* been having with encoding detection.
Compare to the original version of this file
[here](https://github.com/apache/any23/blob/bd607c1cc8c63225f9678ec967c73daa474b45aa/encoding/src/main/java/org/apache/any23/encoding/TikaEncodingDetector.java).
Since that time, I've made a couple changes to the algorithm to fix up
problems we've encountered along the way, but those tweaks weren't as
comprehensive as this one is.
Ideally, I'd like to compare this more comprehensive solution against our
original solution across millions of webpages, but I'm not yet sure how to
proceed in that regard.
---