[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Tim Allison (JIRA) Fri, 29 Jul 2016 12:06:31 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15389466#comment-15389466
 ]


Tim Allison edited comment on TIKA-2038 at 7/29/16 7:06 PM:
------------------------------------------------------------

This is great!  -I've been wanting to add stripping of html markup because I 
also found that that confuses icu4j.- [EDIT: this is wrong, ICU4J already tries 
to do this for content between <...>]

See a comparison on our regression corpus 
[here|http://162.242.228.174/encoding_detection/].  ICU4j generally does better 
than mozilla, but we were getting quite a few incorrect Big5 from ICU4j when 
mozilla had windows-1252/ISO-8859-1.

Our current algorithm is to run the following in order.  The first one with a 
non-null answer is the encoding we choose:
{noformat}
org.apache.tika.parser.html.HtmlEncodingDetector
org.apache.tika.parser.txt.UniversalEncodingDetector
org.apache.tika.parser.txt.Icu4jEncodingDetector
{noformat}

It looks like you maintain this order, check for charset metaheader first, then 
detect if necessary.

Out of curiosity, did you compare the results of your algorithm against the 
metaheader info?  Do you have an estimate of how often that info is wrong?



was (Author: [email protected]):
This is great!  I've been wanting to add stripping of html markup because I 
also found that that confuses icu4j.

See a comparison on our regression corpus 
[here|http://162.242.228.174/encoding_detection/].  ICU4j generally does better 
than mozilla, but we were getting quite a few incorrect Big5 from ICU4j when 
mozilla had windows-1252/ISO-8859-1.

Our current algorithm is to run the following in order.  The first one with a 
non-null answer is the encoding we choose:
{noformat}
org.apache.tika.parser.html.HtmlEncodingDetector
org.apache.tika.parser.txt.UniversalEncodingDetector
org.apache.tika.parser.txt.Icu4jEncodingDetector
{noformat}

It looks like you maintain this order, check for charset metaheader first, then 
detect if necessary.

Out of curiosity, did you compare the results of your algorithm against the 
metaheader info?  Do you have an estimate of how often that info is wrong?


> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Reply via email to