[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Tim Allison (JIRA) Wed, 10 Aug 2016 04:26:13 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15415138#comment-15415138
 ]


Tim Allison commented on TIKA-2038:
-----------------------------------

I regret that I won't have a chance to redo your work in the near future.  If 
you'd like to contribute by re-fetching your urls, that'd be a great help.  
Might be useful to include other languages/countries/encodings, too.

On analysis, I agree that we can use the http header or the http meta-header as 
"ground truth" for one type of analysis, with some caveats, but it should be 
somewhat useful information.

However, I'm concerned for those cases where decoding by several different 
charsets will yield the same results.  If "ground truth" says "UTF-8" and a 
detector says ISO-8859-1 and the page only contains ascii, then the detector 
will be "incorrect", but the results (the extracted text) will be the same.  
The same is true even with, e.g. ISO-8859-1 and win-1256, etc.

So, in addition to the "ground truth" analysis, I propose a second method where 
we run Tika's current algorithm and then we run the proposed change.  We then 
run the tika-eval code to look for content differences and than randomly 
sample, say, 100 differences; we can manually evaluate "better", "worse", 
"mixed".  With 100 cases, our error bars should be small enough to be 
reasonable. 

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, 
> iust_encodings.zip, tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Reply via email to