[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Tim Allison (JIRA) Wed, 03 Aug 2016 11:52:07 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15406408#comment-15406408
 ]


Tim Allison edited comment on TIKA-2038 at 8/3/16 6:51 PM:
-----------------------------------------------------------

I wrote a markup stripper that ignores content in tags, comments, <style> and 
<script> elements.  I then compared:
# Tika's default detection algorithm
# The proposed detection algorithm
# HTMLEncodingDetector
# UniversalEncodingDetector
# UniversalEncodingDetector (on input that had been stripped)
# ICU4J
# ICU4J (on input that had been stripped)

After we do some more evaluation, I propose that we move to this order: 
HTMLEncodingDetector
ICU4J with added stripping

The performance on ICU4J improves dramatically if we strip the style/script 
info, and this is in line with [~faghani] et al's finding.

Let me know what you think...



was (Author: [email protected]):

I wrote a markup stripper that ignores content in tags, comments, <style> and 
<script> elements.  I then compared:
#. Tika's default detection algorithm
#. The proposed detection algorithm
#. HTMLEncodingDetector
#. UniversalEncodingDetector
#. UniversalEncodingDetector (on input that had been stripped)
#. ICU4J
#. ICU4J (on input that had been stripped)

After we do some more evaluation, I propose that we move to this order: 
HTMLEncodingDetector
ICU4J with added stripping

The performance on ICU4J improves dramatically if we strip the style/script 
info, and this is in line with [~faghani] et al's finding.

Let me know what you think...


> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803.xlsx, iust_encodings.zip, 
> tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Reply via email to