Github user HansBrende commented on the issue:
https://github.com/apache/any23/pull/131
@lewismc I've simplified the code a lot so it should be a whole lot easier
to see what's going on now.
Also, I improved the UTF-8 detector by reverse engineering jchardet's
methodology for UTF-8 detection, and created a UTF-8 state machine which does
the same thing as jchardet (in a much more human-readable manner), plus fixed
two bugs in jchardet's UTF-8 detector along the way (possibly due to the lack
of human-readability in the original source code).
I started looking into jchardet because, according to
[TIKA-2038](https://issues.apache.org/jira/browse/TIKA-2038), using it to
detect UTF-8 before anything else increased the accuracy of charset detection
from ~72% to ~96%.
Our encoding detector should now be at least as accurate.
Any thoughts on the methodology, as compared to what we had before?
---