[GitHub] any23 issue #131: ANY23-418 improve TikaEncodingDetector

HansBrende Sun, 11 Nov 2018 18:36:19 -0800

Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/131
  
    @lewismc I've simplified the code a lot so it should be a whole lot easier 
to see what's going on now.
    
    Also, I improved the UTF-8 detector by reverse engineering jchardet's 
methodology for UTF-8 detection, and created a UTF-8 state machine which does 
the same thing as jchardet (in a much more human-readable manner), plus fixed 
two bugs in jchardet's UTF-8 detector along the way (possibly due to the lack 
of human-readability in the original source code). 
    
    I started looking into jchardet because, according to 
[TIKA-2038](https://issues.apache.org/jira/browse/TIKA-2038), using it to 
detect UTF-8 before anything else increased the accuracy of charset detection 
from ~72% to ~96%. 
    
    Our encoding detector should now be at least as accurate.
    
    Any thoughts on the methodology, as compared to what we had before?

---

[GitHub] any23 issue #131: ANY23-418 improve TikaEncodingDetector

Reply via email to