Shabanali Faghani created TIKA-2038:
---------------------------------------

             Summary: A more accurate facility for detecting Charset Encoding 
of HTML documents
                 Key: TIKA-2038
                 URL: https://issues.apache.org/jira/browse/TIKA-2038
             Project: Tika
          Issue Type: New Feature
          Components: core, detector
            Reporter: Shabanali Faghani
            Priority: Minor


Currently, Tika uses icu4j for detecting charset encoding of HTML documents as 
well as the other naturally text documents. But the accuracy of encoding 
detector tools, including icu4j, in dealing with the HTML documents is 
meaningfully less than from which the other text documents. Hence, in our 
project I developed a library that works pretty well for HTML documents, which 
is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
Since Tika is widely used with and within some of other Apache stuffs such as 
Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
the HTML documents, it seems that having such an facility in Tika also will 
help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to