[ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15401870#comment-15401870
 ] 

Tim Allison edited comment on TIKA-2038 at 8/1/16 11:17 AM:
------------------------------------------------------------

bq. Then I remembered that almost all of the test files in my corpus have 
charset information in their Meta tags
To clarify, you're saying that almost all of the test files in the first corpus 
have charset information.  However, to confirm, in the second corpus (language 
dependent), that number drops to 50%, right?


was (Author: [email protected]):
>Then I remembered that almost all of the test files in my corpus have charset 
>information in their Meta tags
To clarify, in the first corpus.  However, to confirm, in the second corpus, 
that number drops to 50%, right?

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to