Hi all, I've developed a java library for detecting Charset Encoding of HTML documents. My tests show that it is much more accurate than the all existing tools in this context including icu4j, jchardet, juniversalchardet, cpdetector, lucene-icu4j and also TikaEncodingDetector.
Bellow, I've provided some links related to my library: Code on github: https://github.com/shabanali-faghani/IUST-HTMLCharDet Paper link: http://link.springer.com/chapter/10.1007/978-3-319-28940-3_17 Maven Central: http://mvnrepository.org/artifact/ir.ac.iust/htmlchardet/1.0.0 Please let me know what is your idea to get this tool in detect package of Tika as another class, say HTMLEncodingDetector, implementing EncodingDetector [1] interface? Or even it may be a better idea to have another module, say tika-encodingdetect, and get HTMLEncodingDetector and other related classes in it with it's own POM! ...just like the tika-langdetect module [2]. Hope that helps Tika! ------------- >From Chris Mattmann in private contact: >Thanks, sure please open up a PR http://github.com/apache/tika/#contributing-via-github > and a discussion on [email protected] and would be happy to proceed. @Chris To open up a PR I've also created an issue in JIRA with id: TIKA-2038 [3]. Thanks, Shabanali [1] http://grepcode.com/file/repo1.maven.org/maven2/org.apache.tika/tika-core/1.9/org/apache/tika/detect/EncodingDetector.java?av=f OR https://github.com/apache/tika/blob/master/tika-core/src/main/java/org/apache/tika/detect/EncodingDetector.java [2] https://github.com/apache/tika/tree/master/tika-langdetect [3] https://issues.apache.org/jira/browse/TIKA-2038
