[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

Hans Brende (JIRA) Fri, 02 Nov 2018 09:44:16 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16673400#comment-16673400
 ]


Hans Brende commented on TIKA-2771:
-----------------------------------

[[email protected]] I totally understand not wanting to modify ICU4J's code. 
But since you've *already* modified it by supporting EBCDIC charsets, that 
unfortunately is going to require additional modifications, since EBCDIC is not 
ASCII-compatible. E.g., in the {{MungeInput()}} method, an ASCII-compatible 
charset that maps 0x3C to "<" and 0x3E to ">" is *presupposed*. And then, the 
space character in IBM500 is not 0x20, but rather *0x40*. As a *bare minimum* 
set of modifications to the CharsetDetector class, I'd recommend the following: 

(1) if *any* tags are stripped from the input (using 0x3C and 0x3E), that 
should automatically make the confidence for all EBCDIC charsets be zero. 
(2) n-gram detection needs to happen using the proper space character (in this 
case, 0x40)

I'd also highly recommend lowering the confidence of n-gram detection for 
shorter text. If the "declared encoding" is compatible with the entire input 
text, but an n-gram detector assigns a confidence of 60 to a different encoding 
based on accidental n-gram detection due to the shortness of the text, the 
declared encoding should take precedence (esp. if the declared encoding is 
UTF-8 and the accidental encoding is, for all practical purposes, used almost 
nowhere). This last issue might, as you say, be an issue for icu4j... however, 
one advantage to copying their code over is the very fact that you don't *have* 
to wait on them to improve your own code. Just a thought.

> enableInputFilter() wrecks charset detection for some short html documents
> --------------------------------------------------------------------------
>
>                 Key: TIKA-2771
>                 URL: https://issues.apache.org/jira/browse/TIKA-2771
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 1.19.1
>            Reporter: Hans Brende
>            Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("<!DOCTYPE html>\n" +
>         "<div>\n" +
>         "  <div itemscope itemtype=\"http://schema.org/Person\"; id=\"amanda\" 
> itemref=\"a b\"></div>\n" +
>         "  <p id=\"a\">Name: <span itemprop=\"name\">Amanda</span></p>\n" +
>         "  <p id=\"b\" itemprop=\"band\">Jazz Band</p>\n" +
>         "</div>").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

Reply via email to