[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Shabanali Faghani (JIRA) Sat, 23 Jul 2016 00:51:01 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15390603#comment-15390603
 ]


Shabanali Faghani commented on TIKA-2038:
-----------------------------------------

Thanks, Tim! I agree with you that icu4j is better than mozilla in general, but 
there are [some 
points|http://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream/37196594#37196594]
 about them that should be noted.

Unfortunately, I didn’t compare the results of my algorithm against the 
charsets in meta tags. I think estimating the ratio of HTML documents with 
wrong charset in meta tags is almost impossible, because as I’ve explained in 
page 11 of [my 
paper|http://link.springer.com/chapter/10.1007/978-3-319-28940-3_17], 
determining if a charset is valid to decode a document is very hard for large 
collections.

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Reply via email to