[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

Tim Allison (JIRA) Mon, 05 Nov 2018 10:08:29 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675511#comment-16675511
 ]


Tim Allison commented on TIKA-2771:
-----------------------------------

Let me try again.  I _think_ I've re-engaged my brain before I started typing 
this time.  Thank you for your patience.

bq. But since you've already modified it by supporting EBCDIC charsets...
+1

bq. (1) if any tags are stripped from the input (using 0x3C and 0x3E), that 
should automatically make the confidence for all EBCDIC charsets be zero. 

Y. I agree with this because the code currently fails to strip if there are too 
many {{badTags}}.

bq. (2) n-gram detection needs to happen using the proper space character (in 
this case, 0x40)
I agree with your point, but to confirm I understand our code, I _think_ we do 
this mapping  in EBCDIC's {{byteMap}} 
([here|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecog_sbcs.java#L1229]).
  We do map 0x40 (and a bunch of other stuff) to 0x20.

bq. (For my last thought, I'd recommend taking a look at the Wilson Score 
interval found here: 
https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval )

I agree that measuring confidence makes a great deal of sense, and perhaps 
Wilson is the right way to go.  However, I'd want to re-think how the stats 
were compiled, how the score is computed and whether there is an improvement as 
part of adding a confidence measurement.  The CharsetDetector, as it stands, 
has quite a bit of hackery in it, and I'd be concerned that adding a confidence 
interval on top of a somewhat, um, heuristic, score might give the wrong 
impression.  In short, I agree, but I'd want to do a bunch more work, 
including, potentially, redoing how the scores are calculated.

bq. "We no longer actively developing the charset detector function."
Yikes.  Thank you for pointing that out!


> enableInputFilter() wrecks charset detection for some short html documents
> --------------------------------------------------------------------------
>
>                 Key: TIKA-2771
>                 URL: https://issues.apache.org/jira/browse/TIKA-2771
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 1.19.1
>            Reporter: Hans Brende
>            Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("<!DOCTYPE html>\n" +
>         "<div>\n" +
>         "  <div itemscope itemtype=\"http://schema.org/Person\"; id=\"amanda\" 
> itemref=\"a b\"></div>\n" +
>         "  <p id=\"a\">Name: <span itemprop=\"name\">Amanda</span></p>\n" +
>         "  <p id=\"b\" itemprop=\"band\">Jazz Band</p>\n" +
>         "</div>").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.
> EDIT: This issue may be related to TIKA-2737 and [this 
> comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

Reply via email to