Shabanali Faghani created TIKA-2038:
---------------------------------------
Summary: A more accurate facility for detecting Charset Encoding
of HTML documents
Key: TIKA-2038
URL: https://issues.apache.org/jira/browse/TIKA-2038
Project: Tika
Issue Type: New Feature
Components: core, detector
Reporter: Shabanali Faghani
Priority: Minor
Currently, Tika uses icu4j for detecting charset encoding of HTML documents as
well as the other naturally text documents. But the accuracy of encoding
detector tools, including icu4j, in dealing with the HTML documents is
meaningfully less than from which the other text documents. Hence, in our
project I developed a library that works pretty well for HTML documents, which
is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
Since Tika is widely used with and within some of other Apache stuffs such as
Nutch, Lucene, Solr, etc. and these projects are strongly in connection with
the HTML documents, it seems that having such an facility in Tika also will
help them to become more accurate.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)