[ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15885392#comment-15885392
 ] 

Shabanali Faghani commented on TIKA-2038:
-----------------------------------------

Some scattered thoughts:
* Unlike your comparison in the file {{charset_comparisons_training.zip}}, I 
don’t think it’d be possible to run _Tika's HtmlEncodingDetector_ 
class/algorithm on a scraped/stripped html document ... because, the meta 
headers are removed after stripping a html document!
* I couldn’t exactly understand what you mean from _training_ in _”I split the 
pull into 80%/20% for training/testing”_.
* Also I can’t perceive why you didn’t involve Tika and IUST in your comparison!
* It maybe would be nice to do also efficiency/speed test for the algorithms.
* _"~1.3 million allegedly htmls totalling ~125GB"_ means that the average size 
of html documents is ~100KB (with or without medias?). That’s while for years I 
thought the average size is 20KB (due to a misunderstanding from the book 
[_Search Engines: Information Retrieval in Practice_| 
https://ciir.cs.umass.edu/irbook/]).
* If the http header is available for all documents I suggest doing a 
peripheral study about the number of html documents that have charset in http 
and meta header. It can be done by querying on the result tuples (if the 
charset in http and meta header are determined for each document).
* just as a subsidiary note about TIKA-2273, I think encoding detection in Tika 
is unclear and I’d be surprised if there are a lot of users/devs whom know what 
and where is the Tika encoding detection module. As a witness (\!) I’d honestly 
say before your first post in this thread I thought you simply use icu4j, 
however maybe I was right at least when I digged into the source code of Tika 
in August 2012.

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, 
> iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, 
> lang-wise-eval_source_code.zip, proposedTLDSampling.csv, 
> tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html_plus_H_column.xlsx, 
> tld_text_html.xlsx
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to