[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Tim Allison (JIRA) Thu, 02 Feb 2017 05:08:31 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15849891#comment-15849891
 ]


Tim Allison commented on TIKA-2038:
-----------------------------------

bq. The accuracy of Tika in overall, i.e. 72%, is less than from the accuracy 
of JUniversalCharDet that is 74%!! It’s an odd phenomenon because 
JUniversalCharDet is a sub-component of Tika. I think this is due to the way 
you use JUniversalCharDet in Tika; that is a kind of early-termination in data 
feeding … listener.handleData(b, 0, m);
In contrast, in this comparison I used feed-all approach as follows …
detector.handleData(rawHtmlByteSequence, 0, rawHtmlByteSequence.length);

Y, this makes sense.  The other potential cause is if an html page 
misidentifies its encoding via a meta-header, Tika will rely on that without 
running the other detectors.


bq. The ASF's Jira doesn't allow to upload files greater than 19.54 MB.  

Right.  On further thought, I would like to build a smallish corpus from Common 
Crawl for this purpose.  If we did random sampling by url country code (.iq, 
.kr, etc.) for the countries you've identified, would that meet our needs?


> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, 
> iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, 
> lang-wise-eval_source_code.zip, tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Reply via email to