[
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15389493#comment-15389493
]
Tim Allison commented on TIKA-2038:
-----------------------------------
For integration into Tika, I see two options:
1) we could add your library in maven central as a dependency. Downside: we'd
then be relying on icu4j (not a small addition). Upside: it would make for
cleaner code on our part. You'd get better credit.
2) you could work with us on a PR to add your logic and some of your code into
Tika. This would allow us to reuse our copy/paste of a few classes from icu4j
without the entire library.
[~faghani] and fellow devs, preferences? My preference would be 2).
[~faghani], do you mind if we add your html test set to our regular testing
corpus?
> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
> Key: TIKA-2038
> URL: https://issues.apache.org/jira/browse/TIKA-2038
> Project: Tika
> Issue Type: Improvement
> Components: core, detector
> Reporter: Shabanali Faghani
> Priority: Minor
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents
> as well as the other naturally text documents. But the accuracy of encoding
> detector tools, including icu4j, in dealing with the HTML documents is
> meaningfully less than from which the other text documents. Hence, in our
> project I developed a library that works pretty well for HTML documents,
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with
> the HTML documents, it seems that having such an facility in Tika also will
> help them to become more accurate.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)