Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/131
  
    @lewismc I've added some additional unit tests which test against the main 
issues we've been having with encoding detection.
    
    Unfortunately, the only real way to comprehensively test this is to compare 
against millions of webpages "in the wild", but I am confident that it 
represents a huge improvement over what we have *now*, based on our past 
problems with encoding detection, plus discussions over in Tika regarding the 
various issues *they've* been having with encoding detection.
    
    Compare to the original version of this file 
[here](https://github.com/apache/any23/blob/bd607c1cc8c63225f9678ec967c73daa474b45aa/encoding/src/main/java/org/apache/any23/encoding/TikaEncodingDetector.java).
    
    Since that time, I've made a couple changes to the algorithm to fix up 
problems we've encountered along the way, but those tweaks weren't as 
comprehensive as this one is.
    
    Ideally, I'd like to compare this more comprehensive solution against our 
original solution across millions of webpages, but I'm not yet sure how to 
proceed in that regard.


---

Reply via email to