A more accurate facility for detecting Charset Encoding of HTML documents

Shabanali Faghani Thu, 21 Jul 2016 15:32:30 -0700

Hi all,

I've developed a java library for detecting Charset Encoding of HTML
documents.
My tests show that it is much more accurate than the all existing tools in
this
context including icu4j, jchardet, juniversalchardet, cpdetector,
lucene-icu4j
and also TikaEncodingDetector.


Bellow, I've provided some links related to my library:
Code on github: https://github.com/shabanali-faghani/IUST-HTMLCharDet
Paper link: http://link.springer.com/chapter/10.1007/978-3-319-28940-3_17
Maven Central:
http://mvnrepository.org/artifact/ir.ac.iust/htmlchardet/1.0.0

Please let me know what is your idea to get this tool in detect
package of Tika as another class, say HTMLEncodingDetector,
implementing EncodingDetector [1] interface? Or even it may
be a better idea to have another module, say tika-encodingdetect,
and get HTMLEncodingDetector and other related classes in it with
it's own POM! ...just like the tika-langdetect module [2].

Hope that helps Tika!
-------------

>From Chris Mattmann in private contact:
>Thanks, sure please open up a PR
http://github.com/apache/tika/#contributing-via-github
> and a discussion on [email protected] and would be happy to proceed.

@Chris
To open up a PR I've also created an issue in JIRA with id: TIKA-2038 [3].

Thanks,
Shabanali

[1]
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.tika/tika-core/1.9/org/apache/tika/detect/EncodingDetector.java?av=f
     OR
https://github.com/apache/tika/blob/master/tika-core/src/main/java/org/apache/tika/detect/EncodingDetector.java
[2] https://github.com/apache/tika/tree/master/tika-langdetect
[3] https://issues.apache.org/jira/browse/TIKA-2038

A more accurate facility for detecting Charset Encoding of HTML documents

Reply via email to