[jira] [Comment Edited] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

Hans Brende (JIRA) Thu, 01 Nov 2018 13:13:16 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16672116#comment-16672116
 ]


Hans Brende edited comment on TIKA-2771 at 11/1/18 8:12 PM:
------------------------------------------------------------

Not sure if this is a contributing factor, but peering into the source code 
reveals that the IBM500 detector is based on ngram detection with a space 
character of 0x20. But the space character for IBM500 is actually 0x40. 

Also, it appears that the confidence for IBM500 is obtained by multiplying the 
raw fractional percentage of ngram hits by 300%. Is that number arbitrary? 
Shouldn't the "confidence" decrease by a lot if the length of the input is very 
small, and therefore not very statistically significant?


was (Author: hansbrende):
Not sure if this is a contributing factor, but peering into the source code 
reveals that IBM500 is based on ngrams with a space character of 0x20. But the 
space character for IBM500 is actually 0x40. 

Also, it appears that the confidence for IBM500 is obtained by multiplying the 
raw fractional percentage of ngram hits by 300%. Is that number arbitrary? 
Shouldn't the "confidence" decrease by a lot if the length of the input is very 
small, and therefore not very statistically significant?

> enableInputFilter() wrecks charset detection for some short html documents
> --------------------------------------------------------------------------
>
>                 Key: TIKA-2771
>                 URL: https://issues.apache.org/jira/browse/TIKA-2771
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 1.19.1
>            Reporter: Hans Brende
>            Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("<!DOCTYPE html>\n" +
>         "<div>\n" +
>         "  <div itemscope itemtype=\"http://schema.org/Person\"; id=\"amanda\" 
> itemref=\"a b\"></div>\n" +
>         "  <p id=\"a\">Name: <span itemprop=\"name\">Amanda</span></p>\n" +
>         "  <p id=\"b\" itemprop=\"band\">Jazz Band</p>\n" +
>         "</div>").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

Reply via email to