[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680557#comment-16680557 ] Hans Brende commented on TIKA-2771: --- [~talli...@apache.org] Great! I will definitely check that out. Detecting common words is a great idea! One concern I'd have with the oversampling of non-UTF-8 pages is as follows: doing so may train encoding detectors to be more confident in non-UTF-8 encodings than they should be. An encoding detector that is optimized for a scenario in which various encodings have "equal representation", so to speak, may actually have a much poorer accuracy for real-life charset distribution (given that > 92% of the web is UTF-8). But I'm curious to hear your rationale as well. (I suppose also that this concern could possibly be mitigated by assigning greater weight to UTF-8 detection successes & failures than that of other charsets, in a manner proportional to their actual occurrences on the web.) > enableInputFilter() wrecks charset detection for some short html documents > -- > > Key: TIKA-2771 > URL: https://issues.apache.org/jira/browse/TIKA-2771 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.19.1 >Reporter: Hans Brende >Priority: Critical > > When I try to run the CharsetDetector on > http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange > most confident result of "IBM500" with a confidence of 60 when I enable the > input filter, *even if I set the declared encoding to UTF-8*. > This can be replicated with the following code: > {code:java} > CharsetDetector detect = new CharsetDetector(); > detect.enableInputFilter(true); > detect.setDeclaredEncoding("UTF-8"); > detect.setText(("\n" + > "\n" + > " http://schema.org/Person\; id=\"amanda\" > itemref=\"a b\">\n" + > " Name: Amanda\n" + > " Jazz Band\n" + > "").getBytes(StandardCharsets.UTF_8)); > Arrays.stream(detect.detectAll()).forEach(System.out::println); > {code} > which prints: > {noformat} > Match of IBM500 in fr with confidence 60 > Match of UTF-8 with confidence 57 > Match of ISO-8859-9 in tr with confidence 50 > Match of ISO-8859-1 in en with confidence 50 > Match of ISO-8859-2 in cs with confidence 12 > Match of Big5 in zh with confidence 10 > Match of EUC-KR in ko with confidence 10 > Match of EUC-JP in ja with confidence 10 > Match of GB18030 in zh with confidence 10 > Match of Shift_JIS in ja with confidence 10 > Match of UTF-16LE with confidence 10 > Match of UTF-16BE with confidence 10 > {noformat} > Note that if I do not set the declared encoding to UTF-8, the result is even > worse, with UTF-8 falling from a confidence of 57 to 15. > This is screwing up 1 out of 84 of my online microdata extraction tests over > in Any23 (as that particular page is being rendered into complete gibberish), > so I had to implement some hacky workarounds which I'd like to remove if > possible. > EDIT: This issue may be related to TIKA-2737 and [this > comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680404#comment-16680404 ] Nick Burch commented on TIKA-2771: -- I'm not sure we do. We have documents along with the encoding that their web server claimed them to have, but I'm not sure I'd trust those server reported ones very far It's possible that you could find some useful files in the ICU4J test suite, though those would probably need wrapping in html tags as I think they're mostly (all?) plain text Otherwise, it might be worth finding some "complicated" and "typical" files from a range of non-English languages, in known encodings, then use iconv (or similar) to convert them to other encodings to test with > enableInputFilter() wrecks charset detection for some short html documents > -- > > Key: TIKA-2771 > URL: https://issues.apache.org/jira/browse/TIKA-2771 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.19.1 >Reporter: Hans Brende >Priority: Critical > > When I try to run the CharsetDetector on > http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange > most confident result of "IBM500" with a confidence of 60 when I enable the > input filter, *even if I set the declared encoding to UTF-8*. > This can be replicated with the following code: > {code:java} > CharsetDetector detect = new CharsetDetector(); > detect.enableInputFilter(true); > detect.setDeclaredEncoding("UTF-8"); > detect.setText(("\n" + > "\n" + > " http://schema.org/Person\; id=\"amanda\" > itemref=\"a b\">\n" + > " Name: Amanda\n" + > " Jazz Band\n" + > "").getBytes(StandardCharsets.UTF_8)); > Arrays.stream(detect.detectAll()).forEach(System.out::println); > {code} > which prints: > {noformat} > Match of IBM500 in fr with confidence 60 > Match of UTF-8 with confidence 57 > Match of ISO-8859-9 in tr with confidence 50 > Match of ISO-8859-1 in en with confidence 50 > Match of ISO-8859-2 in cs with confidence 12 > Match of Big5 in zh with confidence 10 > Match of EUC-KR in ko with confidence 10 > Match of EUC-JP in ja with confidence 10 > Match of GB18030 in zh with confidence 10 > Match of Shift_JIS in ja with confidence 10 > Match of UTF-16LE with confidence 10 > Match of UTF-16BE with confidence 10 > {noformat} > Note that if I do not set the declared encoding to UTF-8, the result is even > worse, with UTF-8 falling from a confidence of 57 to 15. > This is screwing up 1 out of 84 of my online microdata extraction tests over > in Any23 (as that particular page is being rendered into complete gibberish), > so I had to implement some hacky workarounds which I'd like to remove if > possible. > EDIT: This issue may be related to TIKA-2737 and [this > comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680452#comment-16680452 ] Tim Allison commented on TIKA-2771: --- [~HansBrende], funny you mention that...as [~gagravarr] pointed out we now have header info, but that is, um, interesting. I recently refreshed our large scale regression corpus: TIKA-2750. See my wiki entry/blog post that describes how I did that: https://wiki.apache.org/tika/CommonCrawl3 . You can download the sampling frames I used here: http://162.242.228.174/share/commoncrawl3/sampling_frames.zip So, while it would be great if we had ground truth, I had planned to use tika-eval's out of vocabulary (OOV) metrics (https://wiki.apache.org/tika/TikaEval) to compare the text as extracted based on different detectors... the idea is that encoding detectors are not likely to "hallucinate" common words if they're wrong. > enableInputFilter() wrecks charset detection for some short html documents > -- > > Key: TIKA-2771 > URL: https://issues.apache.org/jira/browse/TIKA-2771 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.19.1 >Reporter: Hans Brende >Priority: Critical > > When I try to run the CharsetDetector on > http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange > most confident result of "IBM500" with a confidence of 60 when I enable the > input filter, *even if I set the declared encoding to UTF-8*. > This can be replicated with the following code: > {code:java} > CharsetDetector detect = new CharsetDetector(); > detect.enableInputFilter(true); > detect.setDeclaredEncoding("UTF-8"); > detect.setText(("\n" + > "\n" + > " http://schema.org/Person\; id=\"amanda\" > itemref=\"a b\">\n" + > " Name: Amanda\n" + > " Jazz Band\n" + > "").getBytes(StandardCharsets.UTF_8)); > Arrays.stream(detect.detectAll()).forEach(System.out::println); > {code} > which prints: > {noformat} > Match of IBM500 in fr with confidence 60 > Match of UTF-8 with confidence 57 > Match of ISO-8859-9 in tr with confidence 50 > Match of ISO-8859-1 in en with confidence 50 > Match of ISO-8859-2 in cs with confidence 12 > Match of Big5 in zh with confidence 10 > Match of EUC-KR in ko with confidence 10 > Match of EUC-JP in ja with confidence 10 > Match of GB18030 in zh with confidence 10 > Match of Shift_JIS in ja with confidence 10 > Match of UTF-16LE with confidence 10 > Match of UTF-16BE with confidence 10 > {noformat} > Note that if I do not set the declared encoding to UTF-8, the result is even > worse, with UTF-8 falling from a confidence of 57 to 15. > This is screwing up 1 out of 84 of my online microdata extraction tests over > in Any23 (as that particular page is being rendered into complete gibberish), > so I had to implement some hacky workarounds which I'd like to remove if > possible. > EDIT: This issue may be related to TIKA-2737 and [this > comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680378#comment-16680378 ] Hans Brende commented on TIKA-2771: --- [~talli...@apache.org] Does Tika have a corpus of documents paired with expected encodings that I can use to test out the PR I mentioned? I'm very curious to find out how much of an improvement it is. > enableInputFilter() wrecks charset detection for some short html documents > -- > > Key: TIKA-2771 > URL: https://issues.apache.org/jira/browse/TIKA-2771 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.19.1 >Reporter: Hans Brende >Priority: Critical > > When I try to run the CharsetDetector on > http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange > most confident result of "IBM500" with a confidence of 60 when I enable the > input filter, *even if I set the declared encoding to UTF-8*. > This can be replicated with the following code: > {code:java} > CharsetDetector detect = new CharsetDetector(); > detect.enableInputFilter(true); > detect.setDeclaredEncoding("UTF-8"); > detect.setText(("\n" + > "\n" + > " http://schema.org/Person\; id=\"amanda\" > itemref=\"a b\">\n" + > " Name: Amanda\n" + > " Jazz Band\n" + > "").getBytes(StandardCharsets.UTF_8)); > Arrays.stream(detect.detectAll()).forEach(System.out::println); > {code} > which prints: > {noformat} > Match of IBM500 in fr with confidence 60 > Match of UTF-8 with confidence 57 > Match of ISO-8859-9 in tr with confidence 50 > Match of ISO-8859-1 in en with confidence 50 > Match of ISO-8859-2 in cs with confidence 12 > Match of Big5 in zh with confidence 10 > Match of EUC-KR in ko with confidence 10 > Match of EUC-JP in ja with confidence 10 > Match of GB18030 in zh with confidence 10 > Match of Shift_JIS in ja with confidence 10 > Match of UTF-16LE with confidence 10 > Match of UTF-16BE with confidence 10 > {noformat} > Note that if I do not set the declared encoding to UTF-8, the result is even > worse, with UTF-8 falling from a confidence of 57 to 15. > This is screwing up 1 out of 84 of my online microdata extraction tests over > in Any23 (as that particular page is being rendered into complete gibberish), > so I had to implement some hacky workarounds which I'd like to remove if > possible. > EDIT: This issue may be related to TIKA-2737 and [this > comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677340#comment-16677340 ] Hans Brende commented on TIKA-2771: --- [~talli...@apache.org] I've implemented my ideas for charset detection improvement in [this pull request|https://github.com/apache/any23/pull/131] over in Any23. If you have time to look it over, I'd appreciate any feedback you can give. Some of it might be useful for Tika as well (as I've taken into account a few open issues here). > enableInputFilter() wrecks charset detection for some short html documents > -- > > Key: TIKA-2771 > URL: https://issues.apache.org/jira/browse/TIKA-2771 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.19.1 >Reporter: Hans Brende >Priority: Critical > > When I try to run the CharsetDetector on > http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange > most confident result of "IBM500" with a confidence of 60 when I enable the > input filter, *even if I set the declared encoding to UTF-8*. > This can be replicated with the following code: > {code:java} > CharsetDetector detect = new CharsetDetector(); > detect.enableInputFilter(true); > detect.setDeclaredEncoding("UTF-8"); > detect.setText(("\n" + > "\n" + > " http://schema.org/Person\; id=\"amanda\" > itemref=\"a b\">\n" + > " Name: Amanda\n" + > " Jazz Band\n" + > "").getBytes(StandardCharsets.UTF_8)); > Arrays.stream(detect.detectAll()).forEach(System.out::println); > {code} > which prints: > {noformat} > Match of IBM500 in fr with confidence 60 > Match of UTF-8 with confidence 57 > Match of ISO-8859-9 in tr with confidence 50 > Match of ISO-8859-1 in en with confidence 50 > Match of ISO-8859-2 in cs with confidence 12 > Match of Big5 in zh with confidence 10 > Match of EUC-KR in ko with confidence 10 > Match of EUC-JP in ja with confidence 10 > Match of GB18030 in zh with confidence 10 > Match of Shift_JIS in ja with confidence 10 > Match of UTF-16LE with confidence 10 > Match of UTF-16BE with confidence 10 > {noformat} > Note that if I do not set the declared encoding to UTF-8, the result is even > worse, with UTF-8 falling from a confidence of 57 to 15. > This is screwing up 1 out of 84 of my online microdata extraction tests over > in Any23 (as that particular page is being rendered into complete gibberish), > so I had to implement some hacky workarounds which I'd like to remove if > possible. > EDIT: This issue may be related to TIKA-2737 and [this > comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676926#comment-16676926 ] Hans Brende commented on TIKA-2771: --- One thing I am sure of, however, is that if your chances of getting a false positive for a given charset is *greater* than your chances of actually finding that charset "in the wild", then it is counterproductive to try to detect it in the first place. That goes, not just for IBM500, but for anything that isn't UTF-8. Given that > 90% of the web is UTF-8 (and the web, correct me if I'm wrong, seems to be the primary use-case for charset detection), a charset detector whose strategy is simply: {code:java}return "UTF-8";{code} is going to be at least 90% accurate. Source: https://w3techs.com/technologies/overview/character_encoding/all So detection of any charsets *other than* UTF-8 needs to increase the accuracy to something *greater than* 90%, otherwise the false positives will actually *decrease* the overall accuracy! (I bring this up because I noticed in a different issue thread (TIKA-2038) that it was mentioned that Tika [is only 72% accurate|https://issues.apache.org/jira/browse/TIKA-2038?focusedCommentId=15830525=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15830525]. Am I missing something here? Would we really get a more confident charset for webpages by simply guessing *everything* to be UTF-8?) > enableInputFilter() wrecks charset detection for some short html documents > -- > > Key: TIKA-2771 > URL: https://issues.apache.org/jira/browse/TIKA-2771 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.19.1 >Reporter: Hans Brende >Priority: Critical > > When I try to run the CharsetDetector on > http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange > most confident result of "IBM500" with a confidence of 60 when I enable the > input filter, *even if I set the declared encoding to UTF-8*. > This can be replicated with the following code: > {code:java} > CharsetDetector detect = new CharsetDetector(); > detect.enableInputFilter(true); > detect.setDeclaredEncoding("UTF-8"); > detect.setText(("\n" + > "\n" + > " http://schema.org/Person\; id=\"amanda\" > itemref=\"a b\">\n" + > " Name: Amanda\n" + > " Jazz Band\n" + > "").getBytes(StandardCharsets.UTF_8)); > Arrays.stream(detect.detectAll()).forEach(System.out::println); > {code} > which prints: > {noformat} > Match of IBM500 in fr with confidence 60 > Match of UTF-8 with confidence 57 > Match of ISO-8859-9 in tr with confidence 50 > Match of ISO-8859-1 in en with confidence 50 > Match of ISO-8859-2 in cs with confidence 12 > Match of Big5 in zh with confidence 10 > Match of EUC-KR in ko with confidence 10 > Match of EUC-JP in ja with confidence 10 > Match of GB18030 in zh with confidence 10 > Match of Shift_JIS in ja with confidence 10 > Match of UTF-16LE with confidence 10 > Match of UTF-16BE with confidence 10 > {noformat} > Note that if I do not set the declared encoding to UTF-8, the result is even > worse, with UTF-8 falling from a confidence of 57 to 15. > This is screwing up 1 out of 84 of my online microdata extraction tests over > in Any23 (as that particular page is being rendered into complete gibberish), > so I had to implement some hacky workarounds which I'd like to remove if > possible. > EDIT: This issue may be related to TIKA-2737 and [this > comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676887#comment-16676887 ] Hans Brende commented on TIKA-2771: --- Compare to the following analogous test for ISO-8859-1 variants: ||Issue||ISO-8859-X||byteMap'ed||2nd best||p||n||Wilson L.B.||IBM500 L.B.| |TIKA-771|"Hello, World!"|"hello world "|ISO-8859-1(it)|23%|10|*p' = 7%*|7%| |TIKA-868|"Indanyl"|"indanyl"|ISO-8859-9(tr)|37%|7|*p' = 12%*|13%| |TIKA-2771|"Name: Amanda\nJazz Band"|"name amanda jazz band"|ISO-8859-1(en)|54%|18|*p' = 32%*|24%| To calculate the Wilson lower bound, I used a confidence of 95% (i.e., z = 1.96). I'm not saying that the Wilson lower bound is *the* way to go (as you can see, it wasn't quite enough to fix TIKA-868, although it did reduce the discrepancy from 60% - 37% = *23%* to 13% - 12% = *1%*). So this method might need some adjustments. However, it does seem to represent a significant improvement over the way things are *now*. *A simpler alternative would be to simply discard charsets which get less alphabetic text out of the input than a different charset does. This method would be successful for all 3 of the test cases I've presented here.* > enableInputFilter() wrecks charset detection for some short html documents > -- > > Key: TIKA-2771 > URL: https://issues.apache.org/jira/browse/TIKA-2771 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.19.1 >Reporter: Hans Brende >Priority: Critical > > When I try to run the CharsetDetector on > http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange > most confident result of "IBM500" with a confidence of 60 when I enable the > input filter, *even if I set the declared encoding to UTF-8*. > This can be replicated with the following code: > {code:java} > CharsetDetector detect = new CharsetDetector(); > detect.enableInputFilter(true); > detect.setDeclaredEncoding("UTF-8"); > detect.setText(("\n" + > "\n" + > " http://schema.org/Person\; id=\"amanda\" > itemref=\"a b\">\n" + > " Name: Amanda\n" + > " Jazz Band\n" + > "").getBytes(StandardCharsets.UTF_8)); > Arrays.stream(detect.detectAll()).forEach(System.out::println); > {code} > which prints: > {noformat} > Match of IBM500 in fr with confidence 60 > Match of UTF-8 with confidence 57 > Match of ISO-8859-9 in tr with confidence 50 > Match of ISO-8859-1 in en with confidence 50 > Match of ISO-8859-2 in cs with confidence 12 > Match of Big5 in zh with confidence 10 > Match of EUC-KR in ko with confidence 10 > Match of EUC-JP in ja with confidence 10 > Match of GB18030 in zh with confidence 10 > Match of Shift_JIS in ja with confidence 10 > Match of UTF-16LE with confidence 10 > Match of UTF-16BE with confidence 10 > {noformat} > Note that if I do not set the declared encoding to UTF-8, the result is even > worse, with UTF-8 falling from a confidence of 57 to 15. > This is screwing up 1 out of 84 of my online microdata extraction tests over > in Any23 (as that particular page is being rendered into complete gibberish), > so I had to implement some hacky workarounds which I'd like to remove if > possible. > EDIT: This issue may be related to TIKA-2737 and [this > comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676109#comment-16676109 ] Hans Brende commented on TIKA-2771: --- [~talli...@apache.org] I did a little experimentation with each of the input texts that were causing trouble, for TIKA-771, TIKA-868, and this issue. Here are the results: ||Issue||Input Text||IBM500||byteMap'ed||best||p||n||Wilson L.B.|| |TIKA-771|"Hello, World!"|"çÁ%%? ï?Ê%À "|"çá ï ê à "|IBM500(fr)|30%|5|*p' = 7%*| |TIKA-868|"Indanyl"|"ñ>À/>`%"|"ñ à"|IBM500(fr)|60%|2|*p' = 13%*| |TIKA-2771|"Name: Amanda\nJazz Band"|"+/\_Á \_/>À/ [/:: â/>À"|" á à â à"|IBM500(fr)|66%|4|*p' = 24%*| One thing is evident to me from this test: it's not the mapping of control & punctuation chars to 0x20 that's the problem (the byteMap for ISO-8859-1 *also* strips control & punct chars by mapping them to whitespace)! Rather, the problem lies in the fact that under IBM500, much of the text is *likely* to be mapped to punctuation & control chars, *but the confidence is not reduced when the amount of actual alphabetic text being tested shrinks to near-zero.* The lower bound of the Wilson score confidence interval, however, seems to give a much better estimate of our actual confidence based on the number of characters we actually end up testing. (And while the initial "confidence" value is to some extent arbitrary, the importance of the Wilson lower bound is not the final number we get out, but that we are reducing confidences *relative* to the confidences of other charsets that succeeded in getting more alphabetic text out of the input, and *relative* to the confidences of any declared charsets.) > enableInputFilter() wrecks charset detection for some short html documents > -- > > Key: TIKA-2771 > URL: https://issues.apache.org/jira/browse/TIKA-2771 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.19.1 >Reporter: Hans Brende >Priority: Critical > > When I try to run the CharsetDetector on > http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange > most confident result of "IBM500" with a confidence of 60 when I enable the > input filter, *even if I set the declared encoding to UTF-8*. > This can be replicated with the following code: > {code:java} > CharsetDetector detect = new CharsetDetector(); > detect.enableInputFilter(true); > detect.setDeclaredEncoding("UTF-8"); > detect.setText(("\n" + > "\n" + > " http://schema.org/Person\; id=\"amanda\" > itemref=\"a b\">\n" + > " Name: Amanda\n" + > " Jazz Band\n" + > "").getBytes(StandardCharsets.UTF_8)); > Arrays.stream(detect.detectAll()).forEach(System.out::println); > {code} > which prints: > {noformat} > Match of IBM500 in fr with confidence 60 > Match of UTF-8 with confidence 57 > Match of ISO-8859-9 in tr with confidence 50 > Match of ISO-8859-1 in en with confidence 50 > Match of ISO-8859-2 in cs with confidence 12 > Match of Big5 in zh with confidence 10 > Match of EUC-KR in ko with confidence 10 > Match of EUC-JP in ja with confidence 10 > Match of GB18030 in zh with confidence 10 > Match of Shift_JIS in ja with confidence 10 > Match of UTF-16LE with confidence 10 > Match of UTF-16BE with confidence 10 > {noformat} > Note that if I do not set the declared encoding to UTF-8, the result is even > worse, with UTF-8 falling from a confidence of 57 to 15. > This is screwing up 1 out of 84 of my online microdata extraction tests over > in Any23 (as that particular page is being rendered into complete gibberish), > so I had to implement some hacky workarounds which I'd like to remove if > possible. > EDIT: This issue may be related to TIKA-2737 and [this > comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675880#comment-16675880 ] Hans Brende commented on TIKA-2771: --- [~wave] Yep, just ran the following {code:java} IntStream.range(0, 256).forEach(i -> System.out.println(Integer.toHexString(i) + ": " + new String(new byte[]{(byte)i}, Charset.forName("IBM500")).codePoints() .mapToObj(Integer::toHexString).collect(Collectors.joining(" "; {code} and it does look like IBM500 has a 1-to-1 mapping to latin-1. That should simplify the fix. > enableInputFilter() wrecks charset detection for some short html documents > -- > > Key: TIKA-2771 > URL: https://issues.apache.org/jira/browse/TIKA-2771 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.19.1 >Reporter: Hans Brende >Priority: Critical > > When I try to run the CharsetDetector on > http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange > most confident result of "IBM500" with a confidence of 60 when I enable the > input filter, *even if I set the declared encoding to UTF-8*. > This can be replicated with the following code: > {code:java} > CharsetDetector detect = new CharsetDetector(); > detect.enableInputFilter(true); > detect.setDeclaredEncoding("UTF-8"); > detect.setText(("\n" + > "\n" + > " http://schema.org/Person\; id=\"amanda\" > itemref=\"a b\">\n" + > " Name: Amanda\n" + > " Jazz Band\n" + > "").getBytes(StandardCharsets.UTF_8)); > Arrays.stream(detect.detectAll()).forEach(System.out::println); > {code} > which prints: > {noformat} > Match of IBM500 in fr with confidence 60 > Match of UTF-8 with confidence 57 > Match of ISO-8859-9 in tr with confidence 50 > Match of ISO-8859-1 in en with confidence 50 > Match of ISO-8859-2 in cs with confidence 12 > Match of Big5 in zh with confidence 10 > Match of EUC-KR in ko with confidence 10 > Match of EUC-JP in ja with confidence 10 > Match of GB18030 in zh with confidence 10 > Match of Shift_JIS in ja with confidence 10 > Match of UTF-16LE with confidence 10 > Match of UTF-16BE with confidence 10 > {noformat} > Note that if I do not set the declared encoding to UTF-8, the result is even > worse, with UTF-8 falling from a confidence of 57 to 15. > This is screwing up 1 out of 84 of my online microdata extraction tests over > in Any23 (as that particular page is being rendered into complete gibberish), > so I had to implement some hacky workarounds which I'd like to remove if > possible. > EDIT: This issue may be related to TIKA-2737 and [this > comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675839#comment-16675839 ] Tim Allison commented on TIKA-2771: --- I was thinking something similar... > enableInputFilter() wrecks charset detection for some short html documents > -- > > Key: TIKA-2771 > URL: https://issues.apache.org/jira/browse/TIKA-2771 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.19.1 >Reporter: Hans Brende >Priority: Critical > > When I try to run the CharsetDetector on > http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange > most confident result of "IBM500" with a confidence of 60 when I enable the > input filter, *even if I set the declared encoding to UTF-8*. > This can be replicated with the following code: > {code:java} > CharsetDetector detect = new CharsetDetector(); > detect.enableInputFilter(true); > detect.setDeclaredEncoding("UTF-8"); > detect.setText(("\n" + > "\n" + > " http://schema.org/Person\; id=\"amanda\" > itemref=\"a b\">\n" + > " Name: Amanda\n" + > " Jazz Band\n" + > "").getBytes(StandardCharsets.UTF_8)); > Arrays.stream(detect.detectAll()).forEach(System.out::println); > {code} > which prints: > {noformat} > Match of IBM500 in fr with confidence 60 > Match of UTF-8 with confidence 57 > Match of ISO-8859-9 in tr with confidence 50 > Match of ISO-8859-1 in en with confidence 50 > Match of ISO-8859-2 in cs with confidence 12 > Match of Big5 in zh with confidence 10 > Match of EUC-KR in ko with confidence 10 > Match of EUC-JP in ja with confidence 10 > Match of GB18030 in zh with confidence 10 > Match of Shift_JIS in ja with confidence 10 > Match of UTF-16LE with confidence 10 > Match of UTF-16BE with confidence 10 > {noformat} > Note that if I do not set the declared encoding to UTF-8, the result is even > worse, with UTF-8 falling from a confidence of 57 to 15. > This is screwing up 1 out of 84 of my online microdata extraction tests over > in Any23 (as that particular page is being rendered into complete gibberish), > so I had to implement some hacky workarounds which I'd like to remove if > possible. > EDIT: This issue may be related to TIKA-2737 and [this > comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675828#comment-16675828 ] Hans Brende commented on TIKA-2771: --- [~talli...@apache.org] Ah, you're correct as regards the byteMap. The TODO comment threw me. However, on closer inspection of the IBM500 byteMap, I see an even more alarming issue: 118 out of the 256 bytes map to 0x20!!! But only 0x40 should map to 0x20. This could explain why so many false positives for IBM500 are occurring: *all special characters are mapped to spaces, and then simply ignored*. But in order to have accurate n-gram measurements, those special characters need to be included in the calculations, I believe. I'm not sure, but perhaps they should be mapped to 0x00 instead of 0x20? > enableInputFilter() wrecks charset detection for some short html documents > -- > > Key: TIKA-2771 > URL: https://issues.apache.org/jira/browse/TIKA-2771 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.19.1 >Reporter: Hans Brende >Priority: Critical > > When I try to run the CharsetDetector on > http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange > most confident result of "IBM500" with a confidence of 60 when I enable the > input filter, *even if I set the declared encoding to UTF-8*. > This can be replicated with the following code: > {code:java} > CharsetDetector detect = new CharsetDetector(); > detect.enableInputFilter(true); > detect.setDeclaredEncoding("UTF-8"); > detect.setText(("\n" + > "\n" + > " http://schema.org/Person\; id=\"amanda\" > itemref=\"a b\">\n" + > " Name: Amanda\n" + > " Jazz Band\n" + > "").getBytes(StandardCharsets.UTF_8)); > Arrays.stream(detect.detectAll()).forEach(System.out::println); > {code} > which prints: > {noformat} > Match of IBM500 in fr with confidence 60 > Match of UTF-8 with confidence 57 > Match of ISO-8859-9 in tr with confidence 50 > Match of ISO-8859-1 in en with confidence 50 > Match of ISO-8859-2 in cs with confidence 12 > Match of Big5 in zh with confidence 10 > Match of EUC-KR in ko with confidence 10 > Match of EUC-JP in ja with confidence 10 > Match of GB18030 in zh with confidence 10 > Match of Shift_JIS in ja with confidence 10 > Match of UTF-16LE with confidence 10 > Match of UTF-16BE with confidence 10 > {noformat} > Note that if I do not set the declared encoding to UTF-8, the result is even > worse, with UTF-8 falling from a confidence of 57 to 15. > This is screwing up 1 out of 84 of my online microdata extraction tests over > in Any23 (as that particular page is being rendered into complete gibberish), > so I had to implement some hacky workarounds which I'd like to remove if > possible. > EDIT: This issue may be related to TIKA-2737 and [this > comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675740#comment-16675740 ] Tim Allison commented on TIKA-2771: --- Got it. Thank you. bq. which calls: match(det, ngrams, byteMap, (byte) 0x20); I agree, but doesn't the byteMap that is passed in map 0x40 to 0x20 during the actual match calculations? > enableInputFilter() wrecks charset detection for some short html documents > -- > > Key: TIKA-2771 > URL: https://issues.apache.org/jira/browse/TIKA-2771 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.19.1 >Reporter: Hans Brende >Priority: Critical > > When I try to run the CharsetDetector on > http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange > most confident result of "IBM500" with a confidence of 60 when I enable the > input filter, *even if I set the declared encoding to UTF-8*. > This can be replicated with the following code: > {code:java} > CharsetDetector detect = new CharsetDetector(); > detect.enableInputFilter(true); > detect.setDeclaredEncoding("UTF-8"); > detect.setText(("\n" + > "\n" + > " http://schema.org/Person\; id=\"amanda\" > itemref=\"a b\">\n" + > " Name: Amanda\n" + > " Jazz Band\n" + > "").getBytes(StandardCharsets.UTF_8)); > Arrays.stream(detect.detectAll()).forEach(System.out::println); > {code} > which prints: > {noformat} > Match of IBM500 in fr with confidence 60 > Match of UTF-8 with confidence 57 > Match of ISO-8859-9 in tr with confidence 50 > Match of ISO-8859-1 in en with confidence 50 > Match of ISO-8859-2 in cs with confidence 12 > Match of Big5 in zh with confidence 10 > Match of EUC-KR in ko with confidence 10 > Match of EUC-JP in ja with confidence 10 > Match of GB18030 in zh with confidence 10 > Match of Shift_JIS in ja with confidence 10 > Match of UTF-16LE with confidence 10 > Match of UTF-16BE with confidence 10 > {noformat} > Note that if I do not set the declared encoding to UTF-8, the result is even > worse, with UTF-8 falling from a confidence of 57 to 15. > This is screwing up 1 out of 84 of my online microdata extraction tests over > in Any23 (as that particular page is being rendered into complete gibberish), > so I had to implement some hacky workarounds which I'd like to remove if > possible. > EDIT: This issue may be related to TIKA-2737 and [this > comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675708#comment-16675708 ] Hans Brende commented on TIKA-2771: --- [~talli...@apache.org] I'm not sure which all of the charsets are ASCII-incompatible; I just implemented the check as follows: {code:java} if (tagsWereStripped && !"< />".equals(new String("< />".getBytes(), charset))) { confidence = 0; } {code} Re: space char: {{public CharsetMatch match(CharsetDetector det)}} (CharsetRecog_sbcs.java: 1279) calls {{int match(CharsetDetector det, int[] ngrams, byte[] byteMap)}} (CharsetRecog_sbcs.java: 40) which calls: {{match(det, ngrams, byteMap, (byte) 0x20);}} > enableInputFilter() wrecks charset detection for some short html documents > -- > > Key: TIKA-2771 > URL: https://issues.apache.org/jira/browse/TIKA-2771 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.19.1 >Reporter: Hans Brende >Priority: Critical > > When I try to run the CharsetDetector on > http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange > most confident result of "IBM500" with a confidence of 60 when I enable the > input filter, *even if I set the declared encoding to UTF-8*. > This can be replicated with the following code: > {code:java} > CharsetDetector detect = new CharsetDetector(); > detect.enableInputFilter(true); > detect.setDeclaredEncoding("UTF-8"); > detect.setText(("\n" + > "\n" + > " http://schema.org/Person\; id=\"amanda\" > itemref=\"a b\">\n" + > " Name: Amanda\n" + > " Jazz Band\n" + > "").getBytes(StandardCharsets.UTF_8)); > Arrays.stream(detect.detectAll()).forEach(System.out::println); > {code} > which prints: > {noformat} > Match of IBM500 in fr with confidence 60 > Match of UTF-8 with confidence 57 > Match of ISO-8859-9 in tr with confidence 50 > Match of ISO-8859-1 in en with confidence 50 > Match of ISO-8859-2 in cs with confidence 12 > Match of Big5 in zh with confidence 10 > Match of EUC-KR in ko with confidence 10 > Match of EUC-JP in ja with confidence 10 > Match of GB18030 in zh with confidence 10 > Match of Shift_JIS in ja with confidence 10 > Match of UTF-16LE with confidence 10 > Match of UTF-16BE with confidence 10 > {noformat} > Note that if I do not set the declared encoding to UTF-8, the result is even > worse, with UTF-8 falling from a confidence of 57 to 15. > This is screwing up 1 out of 84 of my online microdata extraction tests over > in Any23 (as that particular page is being rendered into complete gibberish), > so I had to implement some hacky workarounds which I'd like to remove if > possible. > EDIT: This issue may be related to TIKA-2737 and [this > comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675527#comment-16675527 ] Tim Allison commented on TIKA-2771: --- I'm happy enough adding this check into EBCDIC500. Are there any other charsets for which we should do the same (e.g. where 3C and 3E do not == '<' and '>')? At some point, I started work on a stripper that took into consideration comments, script, style, etc I think we might want to look into that again at some point. We may be able to use the {{PreScanner}} or some of the other code that arrived recently in {{o.a.t.parsers.html.charsetdetector}} package. > enableInputFilter() wrecks charset detection for some short html documents > -- > > Key: TIKA-2771 > URL: https://issues.apache.org/jira/browse/TIKA-2771 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.19.1 >Reporter: Hans Brende >Priority: Critical > > When I try to run the CharsetDetector on > http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange > most confident result of "IBM500" with a confidence of 60 when I enable the > input filter, *even if I set the declared encoding to UTF-8*. > This can be replicated with the following code: > {code:java} > CharsetDetector detect = new CharsetDetector(); > detect.enableInputFilter(true); > detect.setDeclaredEncoding("UTF-8"); > detect.setText(("\n" + > "\n" + > " http://schema.org/Person\; id=\"amanda\" > itemref=\"a b\">\n" + > " Name: Amanda\n" + > " Jazz Band\n" + > "").getBytes(StandardCharsets.UTF_8)); > Arrays.stream(detect.detectAll()).forEach(System.out::println); > {code} > which prints: > {noformat} > Match of IBM500 in fr with confidence 60 > Match of UTF-8 with confidence 57 > Match of ISO-8859-9 in tr with confidence 50 > Match of ISO-8859-1 in en with confidence 50 > Match of ISO-8859-2 in cs with confidence 12 > Match of Big5 in zh with confidence 10 > Match of EUC-KR in ko with confidence 10 > Match of EUC-JP in ja with confidence 10 > Match of GB18030 in zh with confidence 10 > Match of Shift_JIS in ja with confidence 10 > Match of UTF-16LE with confidence 10 > Match of UTF-16BE with confidence 10 > {noformat} > Note that if I do not set the declared encoding to UTF-8, the result is even > worse, with UTF-8 falling from a confidence of 57 to 15. > This is screwing up 1 out of 84 of my online microdata extraction tests over > in Any23 (as that particular page is being rendered into complete gibberish), > so I had to implement some hacky workarounds which I'd like to remove if > possible. > EDIT: This issue may be related to TIKA-2737 and [this > comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675520#comment-16675520 ] Tim Allison commented on TIKA-2771: --- When I add a {{tagsWereStripped}}, and have the EBCDIC500 charsets return null if that is {{true}}, I get these results with the "declared encoding"="UTF-8" and enable filter: {noformat} Match of UTF-8 with confidence 57 Match of ISO-8859-9 in tr with confidence 50 Match of ISO-8859-1 in en with confidence 50 Match of ISO-8859-2 in cs with confidence 12 Match of Big5 in zh with confidence 10 ... {noformat} With declared encoding = "UTF-8" and don't enable filter: {noformat} Match of UTF-8 with confidence 57 Match of ISO-8859-1 in en with confidence 31 Match of ISO-8859-9 in tr with confidence 19 Match of ISO-8859-2 in ro with confidence 15 ... {noformat} With no declared encoding and "enableFilter=true": {noformat} Match of ISO-8859-9 in tr with confidence 50 Match of ISO-8859-1 in en with confidence 50 Match of UTF-8 with confidence 15 Match of ISO-8859-2 in cs with confidence 12 ... {noformat} With no declared encoding and "enableFilter=false" {noformat} Match of ISO-8859-1 in en with confidence 31 Match of ISO-8859-9 in tr with confidence 19 Match of ISO-8859-2 in ro with confidence 15 Match of UTF-8 with confidence 15 ... {noformat} > enableInputFilter() wrecks charset detection for some short html documents > -- > > Key: TIKA-2771 > URL: https://issues.apache.org/jira/browse/TIKA-2771 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.19.1 >Reporter: Hans Brende >Priority: Critical > > When I try to run the CharsetDetector on > http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange > most confident result of "IBM500" with a confidence of 60 when I enable the > input filter, *even if I set the declared encoding to UTF-8*. > This can be replicated with the following code: > {code:java} > CharsetDetector detect = new CharsetDetector(); > detect.enableInputFilter(true); > detect.setDeclaredEncoding("UTF-8"); > detect.setText(("\n" + > "\n" + > " http://schema.org/Person\; id=\"amanda\" > itemref=\"a b\">\n" + > " Name: Amanda\n" + > " Jazz Band\n" + > "").getBytes(StandardCharsets.UTF_8)); > Arrays.stream(detect.detectAll()).forEach(System.out::println); > {code} > which prints: > {noformat} > Match of IBM500 in fr with confidence 60 > Match of UTF-8 with confidence 57 > Match of ISO-8859-9 in tr with confidence 50 > Match of ISO-8859-1 in en with confidence 50 > Match of ISO-8859-2 in cs with confidence 12 > Match of Big5 in zh with confidence 10 > Match of EUC-KR in ko with confidence 10 > Match of EUC-JP in ja with confidence 10 > Match of GB18030 in zh with confidence 10 > Match of Shift_JIS in ja with confidence 10 > Match of UTF-16LE with confidence 10 > Match of UTF-16BE with confidence 10 > {noformat} > Note that if I do not set the declared encoding to UTF-8, the result is even > worse, with UTF-8 falling from a confidence of 57 to 15. > This is screwing up 1 out of 84 of my online microdata extraction tests over > in Any23 (as that particular page is being rendered into complete gibberish), > so I had to implement some hacky workarounds which I'd like to remove if > possible. > EDIT: This issue may be related to TIKA-2737 and [this > comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675511#comment-16675511 ] Tim Allison commented on TIKA-2771: --- Let me try again. I _think_ I've re-engaged my brain before I started typing this time. Thank you for your patience. bq. But since you've already modified it by supporting EBCDIC charsets... +1 bq. (1) if any tags are stripped from the input (using 0x3C and 0x3E), that should automatically make the confidence for all EBCDIC charsets be zero. Y. I agree with this because the code currently fails to strip if there are too many {{badTags}}. bq. (2) n-gram detection needs to happen using the proper space character (in this case, 0x40) I agree with your point, but to confirm I understand our code, I _think_ we do this mapping in EBCDIC's {{byteMap}} ([here|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecog_sbcs.java#L1229]). We do map 0x40 (and a bunch of other stuff) to 0x20. bq. (For my last thought, I'd recommend taking a look at the Wilson Score interval found here: https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval ) I agree that measuring confidence makes a great deal of sense, and perhaps Wilson is the right way to go. However, I'd want to re-think how the stats were compiled, how the score is computed and whether there is an improvement as part of adding a confidence measurement. The CharsetDetector, as it stands, has quite a bit of hackery in it, and I'd be concerned that adding a confidence interval on top of a somewhat, um, heuristic, score might give the wrong impression. In short, I agree, but I'd want to do a bunch more work, including, potentially, redoing how the scores are calculated. bq. "We no longer actively developing the charset detector function." Yikes. Thank you for pointing that out! > enableInputFilter() wrecks charset detection for some short html documents > -- > > Key: TIKA-2771 > URL: https://issues.apache.org/jira/browse/TIKA-2771 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.19.1 >Reporter: Hans Brende >Priority: Critical > > When I try to run the CharsetDetector on > http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange > most confident result of "IBM500" with a confidence of 60 when I enable the > input filter, *even if I set the declared encoding to UTF-8*. > This can be replicated with the following code: > {code:java} > CharsetDetector detect = new CharsetDetector(); > detect.enableInputFilter(true); > detect.setDeclaredEncoding("UTF-8"); > detect.setText(("\n" + > "\n" + > " http://schema.org/Person\; id=\"amanda\" > itemref=\"a b\">\n" + > " Name: Amanda\n" + > " Jazz Band\n" + > "").getBytes(StandardCharsets.UTF_8)); > Arrays.stream(detect.detectAll()).forEach(System.out::println); > {code} > which prints: > {noformat} > Match of IBM500 in fr with confidence 60 > Match of UTF-8 with confidence 57 > Match of ISO-8859-9 in tr with confidence 50 > Match of ISO-8859-1 in en with confidence 50 > Match of ISO-8859-2 in cs with confidence 12 > Match of Big5 in zh with confidence 10 > Match of EUC-KR in ko with confidence 10 > Match of EUC-JP in ja with confidence 10 > Match of GB18030 in zh with confidence 10 > Match of Shift_JIS in ja with confidence 10 > Match of UTF-16LE with confidence 10 > Match of UTF-16BE with confidence 10 > {noformat} > Note that if I do not set the declared encoding to UTF-8, the result is even > worse, with UTF-8 falling from a confidence of 57 to 15. > This is screwing up 1 out of 84 of my online microdata extraction tests over > in Any23 (as that particular page is being rendered into complete gibberish), > so I had to implement some hacky workarounds which I'd like to remove if > possible. > EDIT: This issue may be related to TIKA-2737 and [this > comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673481#comment-16673481 ] Hans Brende commented on TIKA-2771: --- (Also relating to my last thought, on the subject of "waiting for icu4j", I see: "We no longer actively developing the charset detector function." from https://unicode-org.atlassian.net/browse/ICU-13465 ) > enableInputFilter() wrecks charset detection for some short html documents > -- > > Key: TIKA-2771 > URL: https://issues.apache.org/jira/browse/TIKA-2771 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.19.1 >Reporter: Hans Brende >Priority: Critical > > When I try to run the CharsetDetector on > http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange > most confident result of "IBM500" with a confidence of 60 when I enable the > input filter, *even if I set the declared encoding to UTF-8*. > This can be replicated with the following code: > {code:java} > CharsetDetector detect = new CharsetDetector(); > detect.enableInputFilter(true); > detect.setDeclaredEncoding("UTF-8"); > detect.setText(("\n" + > "\n" + > " http://schema.org/Person\; id=\"amanda\" > itemref=\"a b\">\n" + > " Name: Amanda\n" + > " Jazz Band\n" + > "").getBytes(StandardCharsets.UTF_8)); > Arrays.stream(detect.detectAll()).forEach(System.out::println); > {code} > which prints: > {noformat} > Match of IBM500 in fr with confidence 60 > Match of UTF-8 with confidence 57 > Match of ISO-8859-9 in tr with confidence 50 > Match of ISO-8859-1 in en with confidence 50 > Match of ISO-8859-2 in cs with confidence 12 > Match of Big5 in zh with confidence 10 > Match of EUC-KR in ko with confidence 10 > Match of EUC-JP in ja with confidence 10 > Match of GB18030 in zh with confidence 10 > Match of Shift_JIS in ja with confidence 10 > Match of UTF-16LE with confidence 10 > Match of UTF-16BE with confidence 10 > {noformat} > Note that if I do not set the declared encoding to UTF-8, the result is even > worse, with UTF-8 falling from a confidence of 57 to 15. > This is screwing up 1 out of 84 of my online microdata extraction tests over > in Any23 (as that particular page is being rendered into complete gibberish), > so I had to implement some hacky workarounds which I'd like to remove if > possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673429#comment-16673429 ] Hans Brende commented on TIKA-2771: --- (For my last thought, I'd recommend taking a look at this: https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval ) > enableInputFilter() wrecks charset detection for some short html documents > -- > > Key: TIKA-2771 > URL: https://issues.apache.org/jira/browse/TIKA-2771 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.19.1 >Reporter: Hans Brende >Priority: Critical > > When I try to run the CharsetDetector on > http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange > most confident result of "IBM500" with a confidence of 60 when I enable the > input filter, *even if I set the declared encoding to UTF-8*. > This can be replicated with the following code: > {code:java} > CharsetDetector detect = new CharsetDetector(); > detect.enableInputFilter(true); > detect.setDeclaredEncoding("UTF-8"); > detect.setText(("\n" + > "\n" + > " http://schema.org/Person\; id=\"amanda\" > itemref=\"a b\">\n" + > " Name: Amanda\n" + > " Jazz Band\n" + > "").getBytes(StandardCharsets.UTF_8)); > Arrays.stream(detect.detectAll()).forEach(System.out::println); > {code} > which prints: > {noformat} > Match of IBM500 in fr with confidence 60 > Match of UTF-8 with confidence 57 > Match of ISO-8859-9 in tr with confidence 50 > Match of ISO-8859-1 in en with confidence 50 > Match of ISO-8859-2 in cs with confidence 12 > Match of Big5 in zh with confidence 10 > Match of EUC-KR in ko with confidence 10 > Match of EUC-JP in ja with confidence 10 > Match of GB18030 in zh with confidence 10 > Match of Shift_JIS in ja with confidence 10 > Match of UTF-16LE with confidence 10 > Match of UTF-16BE with confidence 10 > {noformat} > Note that if I do not set the declared encoding to UTF-8, the result is even > worse, with UTF-8 falling from a confidence of 57 to 15. > This is screwing up 1 out of 84 of my online microdata extraction tests over > in Any23 (as that particular page is being rendered into complete gibberish), > so I had to implement some hacky workarounds which I'd like to remove if > possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673400#comment-16673400 ] Hans Brende commented on TIKA-2771: --- [~talli...@apache.org] I totally understand not wanting to modify ICU4J's code. But since you've *already* modified it by supporting EBCDIC charsets, that unfortunately is going to require additional modifications, since EBCDIC is not ASCII-compatible. E.g., in the {{MungeInput()}} method, an ASCII-compatible charset that maps 0x3C to "<" and 0x3E to ">" is *presupposed*. And then, the space character in IBM500 is not 0x20, but rather *0x40*. As a *bare minimum* set of modifications to the CharsetDetector class, I'd recommend the following: (1) if *any* tags are stripped from the input (using 0x3C and 0x3E), that should automatically make the confidence for all EBCDIC charsets be zero. (2) n-gram detection needs to happen using the proper space character (in this case, 0x40) I'd also highly recommend lowering the confidence of n-gram detection for shorter text. If the "declared encoding" is compatible with the entire input text, but an n-gram detector assigns a confidence of 60 to a different encoding based on accidental n-gram detection due to the shortness of the text, the declared encoding should take precedence (esp. if the declared encoding is UTF-8 and the accidental encoding is, for all practical purposes, used almost nowhere). This last issue might, as you say, be an issue for icu4j... however, one advantage to copying their code over is the very fact that you don't *have* to wait on them to improve your own code. Just a thought. > enableInputFilter() wrecks charset detection for some short html documents > -- > > Key: TIKA-2771 > URL: https://issues.apache.org/jira/browse/TIKA-2771 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.19.1 >Reporter: Hans Brende >Priority: Critical > > When I try to run the CharsetDetector on > http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange > most confident result of "IBM500" with a confidence of 60 when I enable the > input filter, *even if I set the declared encoding to UTF-8*. > This can be replicated with the following code: > {code:java} > CharsetDetector detect = new CharsetDetector(); > detect.enableInputFilter(true); > detect.setDeclaredEncoding("UTF-8"); > detect.setText(("\n" + > "\n" + > " http://schema.org/Person\; id=\"amanda\" > itemref=\"a b\">\n" + > " Name: Amanda\n" + > " Jazz Band\n" + > "").getBytes(StandardCharsets.UTF_8)); > Arrays.stream(detect.detectAll()).forEach(System.out::println); > {code} > which prints: > {noformat} > Match of IBM500 in fr with confidence 60 > Match of UTF-8 with confidence 57 > Match of ISO-8859-9 in tr with confidence 50 > Match of ISO-8859-1 in en with confidence 50 > Match of ISO-8859-2 in cs with confidence 12 > Match of Big5 in zh with confidence 10 > Match of EUC-KR in ko with confidence 10 > Match of EUC-JP in ja with confidence 10 > Match of GB18030 in zh with confidence 10 > Match of Shift_JIS in ja with confidence 10 > Match of UTF-16LE with confidence 10 > Match of UTF-16BE with confidence 10 > {noformat} > Note that if I do not set the declared encoding to UTF-8, the result is even > worse, with UTF-8 falling from a confidence of 57 to 15. > This is screwing up 1 out of 84 of my online microdata extraction tests over > in Any23 (as that particular page is being rendered into complete gibberish), > so I had to implement some hacky workarounds which I'd like to remove if > possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673220#comment-16673220 ] Tim Allison commented on TIKA-2771: --- let me re-engage brain before typing again...sorry. > enableInputFilter() wrecks charset detection for some short html documents > -- > > Key: TIKA-2771 > URL: https://issues.apache.org/jira/browse/TIKA-2771 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.19.1 >Reporter: Hans Brende >Priority: Critical > > When I try to run the CharsetDetector on > http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange > most confident result of "IBM500" with a confidence of 60 when I enable the > input filter, *even if I set the declared encoding to UTF-8*. > This can be replicated with the following code: > {code:java} > CharsetDetector detect = new CharsetDetector(); > detect.enableInputFilter(true); > detect.setDeclaredEncoding("UTF-8"); > detect.setText(("\n" + > "\n" + > " http://schema.org/Person\; id=\"amanda\" > itemref=\"a b\">\n" + > " Name: Amanda\n" + > " Jazz Band\n" + > "").getBytes(StandardCharsets.UTF_8)); > Arrays.stream(detect.detectAll()).forEach(System.out::println); > {code} > which prints: > {noformat} > Match of IBM500 in fr with confidence 60 > Match of UTF-8 with confidence 57 > Match of ISO-8859-9 in tr with confidence 50 > Match of ISO-8859-1 in en with confidence 50 > Match of ISO-8859-2 in cs with confidence 12 > Match of Big5 in zh with confidence 10 > Match of EUC-KR in ko with confidence 10 > Match of EUC-JP in ja with confidence 10 > Match of GB18030 in zh with confidence 10 > Match of Shift_JIS in ja with confidence 10 > Match of UTF-16LE with confidence 10 > Match of UTF-16BE with confidence 10 > {noformat} > Note that if I do not set the declared encoding to UTF-8, the result is even > worse, with UTF-8 falling from a confidence of 57 to 15. > This is screwing up 1 out of 84 of my online microdata extraction tests over > in Any23 (as that particular page is being rendered into complete gibberish), > so I had to implement some hacky workarounds which I'd like to remove if > possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673215#comment-16673215 ] Hans Brende commented on TIKA-2771: --- [~talli...@apache.org] IBM500 (a.k.a. EBCDIC 500) is an EBCDIC charset. So, I'm confused: are you saying we should ask icu4j to support EBCDIC charsets, and then recopy their code? > enableInputFilter() wrecks charset detection for some short html documents > -- > > Key: TIKA-2771 > URL: https://issues.apache.org/jira/browse/TIKA-2771 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.19.1 >Reporter: Hans Brende >Priority: Critical > > When I try to run the CharsetDetector on > http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange > most confident result of "IBM500" with a confidence of 60 when I enable the > input filter, *even if I set the declared encoding to UTF-8*. > This can be replicated with the following code: > {code:java} > CharsetDetector detect = new CharsetDetector(); > detect.enableInputFilter(true); > detect.setDeclaredEncoding("UTF-8"); > detect.setText(("\n" + > "\n" + > " http://schema.org/Person\; id=\"amanda\" > itemref=\"a b\">\n" + > " Name: Amanda\n" + > " Jazz Band\n" + > "").getBytes(StandardCharsets.UTF_8)); > Arrays.stream(detect.detectAll()).forEach(System.out::println); > {code} > which prints: > {noformat} > Match of IBM500 in fr with confidence 60 > Match of UTF-8 with confidence 57 > Match of ISO-8859-9 in tr with confidence 50 > Match of ISO-8859-1 in en with confidence 50 > Match of ISO-8859-2 in cs with confidence 12 > Match of Big5 in zh with confidence 10 > Match of EUC-KR in ko with confidence 10 > Match of EUC-JP in ja with confidence 10 > Match of GB18030 in zh with confidence 10 > Match of Shift_JIS in ja with confidence 10 > Match of UTF-16LE with confidence 10 > Match of UTF-16BE with confidence 10 > {noformat} > Note that if I do not set the declared encoding to UTF-8, the result is even > worse, with UTF-8 falling from a confidence of 57 to 15. > This is screwing up 1 out of 84 of my online microdata extraction tests over > in Any23 (as that particular page is being rendered into complete gibberish), > so I had to implement some hacky workarounds which I'd like to remove if > possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673188#comment-16673188 ] Tim Allison commented on TIKA-2771: --- [~HansBrende], thank you for raising this issue and sharing this with us. Let's figure out how to fix this. Charset detection on short strings is always problematic. The CharsetDetector is a copy/paste from icu4j. I _think_ the only difference is that we've added EBCDIC charsets that icu4j didn't want to support. While I agree with you on the above, I'd much prefer to get the changes into ICU4j than to modify our fork and then try to maintain that delta when we next copy/paste from ICU4j. If you still think there's a need to make modifications to our preprocessing, I'd be open to that, but the actual algorithmic changes should be made upstream, IMHO. We do have a charset-override option which will allow you to say "treat this as (e.g.) UTF-8" no matter what detection says. Set whatever encoding you want in the Metadata object with this key: {{TikaCoreProperties.CONTENT_TYPE_OVERRIDE}} and then ask the AutoDetectReader or the DefaultEncodingDetector to read your bytes. This will not shortcut our copy of ICU4j's CharsetDetector because it relies on the OverrideDetector being called first within the DefaultDetector. > enableInputFilter() wrecks charset detection for some short html documents > -- > > Key: TIKA-2771 > URL: https://issues.apache.org/jira/browse/TIKA-2771 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.19.1 >Reporter: Hans Brende >Priority: Critical > > When I try to run the CharsetDetector on > http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange > most confident result of "IBM500" with a confidence of 60 when I enable the > input filter, *even if I set the declared encoding to UTF-8*. > This can be replicated with the following code: > {code:java} > CharsetDetector detect = new CharsetDetector(); > detect.enableInputFilter(true); > detect.setDeclaredEncoding("UTF-8"); > detect.setText(("\n" + > "\n" + > " http://schema.org/Person\; id=\"amanda\" > itemref=\"a b\">\n" + > " Name: Amanda\n" + > " Jazz Band\n" + > "").getBytes(StandardCharsets.UTF_8)); > Arrays.stream(detect.detectAll()).forEach(System.out::println); > {code} > which prints: > {noformat} > Match of IBM500 in fr with confidence 60 > Match of UTF-8 with confidence 57 > Match of ISO-8859-9 in tr with confidence 50 > Match of ISO-8859-1 in en with confidence 50 > Match of ISO-8859-2 in cs with confidence 12 > Match of Big5 in zh with confidence 10 > Match of EUC-KR in ko with confidence 10 > Match of EUC-JP in ja with confidence 10 > Match of GB18030 in zh with confidence 10 > Match of Shift_JIS in ja with confidence 10 > Match of UTF-16LE with confidence 10 > Match of UTF-16BE with confidence 10 > {noformat} > Note that if I do not set the declared encoding to UTF-8, the result is even > worse, with UTF-8 falling from a confidence of 57 to 15. > This is screwing up 1 out of 84 of my online microdata extraction tests over > in Any23 (as that particular page is being rendered into complete gibberish), > so I had to implement some hacky workarounds which I'd like to remove if > possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672520#comment-16672520 ] Hans Brende commented on TIKA-2771: --- Just had another thought: when the input filter is enabled, it strips everything within "<" and ">" brackets (i.e. 0x3C and 0x3E), correct? But doing so *presupposes* an ASCII-compatible encoding! Thus, if a significant number of matching "<" and ">" symbols are found, you *already* know it can't be IBM500! ("<" and ">" in IBM500 are 0x4C and 0x6E, respectively.) I assume you could extend this logic to other ASCII-incompatible charsets as well. > enableInputFilter() wrecks charset detection for some short html documents > -- > > Key: TIKA-2771 > URL: https://issues.apache.org/jira/browse/TIKA-2771 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.19.1 >Reporter: Hans Brende >Priority: Critical > > When I try to run the CharsetDetector on > http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange > most confident result of "IBM500" with a confidence of 60 when I enable the > input filter, *even if I set the declared encoding to UTF-8*. > This can be replicated with the following code: > {code:java} > CharsetDetector detect = new CharsetDetector(); > detect.enableInputFilter(true); > detect.setDeclaredEncoding("UTF-8"); > detect.setText(("\n" + > "\n" + > " http://schema.org/Person\; id=\"amanda\" > itemref=\"a b\">\n" + > " Name: Amanda\n" + > " Jazz Band\n" + > "").getBytes(StandardCharsets.UTF_8)); > Arrays.stream(detect.detectAll()).forEach(System.out::println); > {code} > which prints: > {noformat} > Match of IBM500 in fr with confidence 60 > Match of UTF-8 with confidence 57 > Match of ISO-8859-9 in tr with confidence 50 > Match of ISO-8859-1 in en with confidence 50 > Match of ISO-8859-2 in cs with confidence 12 > Match of Big5 in zh with confidence 10 > Match of EUC-KR in ko with confidence 10 > Match of EUC-JP in ja with confidence 10 > Match of GB18030 in zh with confidence 10 > Match of Shift_JIS in ja with confidence 10 > Match of UTF-16LE with confidence 10 > Match of UTF-16BE with confidence 10 > {noformat} > Note that if I do not set the declared encoding to UTF-8, the result is even > worse, with UTF-8 falling from a confidence of 57 to 15. > This is screwing up 1 out of 84 of my online microdata extraction tests over > in Any23 (as that particular page is being rendered into complete gibberish), > so I had to implement some hacky workarounds which I'd like to remove if > possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672203#comment-16672203 ] Hans Brende commented on TIKA-2771: --- (Source: https://w3techs.com/technologies/overview/character_encoding/all ) > enableInputFilter() wrecks charset detection for some short html documents > -- > > Key: TIKA-2771 > URL: https://issues.apache.org/jira/browse/TIKA-2771 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.19.1 >Reporter: Hans Brende >Priority: Critical > > When I try to run the CharsetDetector on > http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange > most confident result of "IBM500" with a confidence of 60 when I enable the > input filter, *even if I set the declared encoding to UTF-8*. > This can be replicated with the following code: > {code:java} > CharsetDetector detect = new CharsetDetector(); > detect.enableInputFilter(true); > detect.setDeclaredEncoding("UTF-8"); > detect.setText(("\n" + > "\n" + > " http://schema.org/Person\; id=\"amanda\" > itemref=\"a b\">\n" + > " Name: Amanda\n" + > " Jazz Band\n" + > "").getBytes(StandardCharsets.UTF_8)); > Arrays.stream(detect.detectAll()).forEach(System.out::println); > {code} > which prints: > {noformat} > Match of IBM500 in fr with confidence 60 > Match of UTF-8 with confidence 57 > Match of ISO-8859-9 in tr with confidence 50 > Match of ISO-8859-1 in en with confidence 50 > Match of ISO-8859-2 in cs with confidence 12 > Match of Big5 in zh with confidence 10 > Match of EUC-KR in ko with confidence 10 > Match of EUC-JP in ja with confidence 10 > Match of GB18030 in zh with confidence 10 > Match of Shift_JIS in ja with confidence 10 > Match of UTF-16LE with confidence 10 > Match of UTF-16BE with confidence 10 > {noformat} > Note that if I do not set the declared encoding to UTF-8, the result is even > worse, with UTF-8 falling from a confidence of 57 to 15. > This is screwing up 1 out of 84 of my online microdata extraction tests over > in Any23 (as that particular page is being rendered into complete gibberish), > so I had to implement some hacky workarounds which I'd like to remove if > possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672196#comment-16672196 ] Hans Brende commented on TIKA-2771: --- Oh... and probably the best hint of all that this is not IBM500 is that approximately 0.0% of html markup worldwide is encoded in IBM500. > enableInputFilter() wrecks charset detection for some short html documents > -- > > Key: TIKA-2771 > URL: https://issues.apache.org/jira/browse/TIKA-2771 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.19.1 >Reporter: Hans Brende >Priority: Critical > > When I try to run the CharsetDetector on > http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange > most confident result of "IBM500" with a confidence of 60 when I enable the > input filter, *even if I set the declared encoding to UTF-8*. > This can be replicated with the following code: > {code:java} > CharsetDetector detect = new CharsetDetector(); > detect.enableInputFilter(true); > detect.setDeclaredEncoding("UTF-8"); > detect.setText(("\n" + > "\n" + > " http://schema.org/Person\; id=\"amanda\" > itemref=\"a b\">\n" + > " Name: Amanda\n" + > " Jazz Band\n" + > "").getBytes(StandardCharsets.UTF_8)); > Arrays.stream(detect.detectAll()).forEach(System.out::println); > {code} > which prints: > {noformat} > Match of IBM500 in fr with confidence 60 > Match of UTF-8 with confidence 57 > Match of ISO-8859-9 in tr with confidence 50 > Match of ISO-8859-1 in en with confidence 50 > Match of ISO-8859-2 in cs with confidence 12 > Match of Big5 in zh with confidence 10 > Match of EUC-KR in ko with confidence 10 > Match of EUC-JP in ja with confidence 10 > Match of GB18030 in zh with confidence 10 > Match of Shift_JIS in ja with confidence 10 > Match of UTF-16LE with confidence 10 > Match of UTF-16BE with confidence 10 > {noformat} > Note that if I do not set the declared encoding to UTF-8, the result is even > worse, with UTF-8 falling from a confidence of 57 to 15. > This is screwing up 1 out of 84 of my online microdata extraction tests over > in Any23 (as that particular page is being rendered into complete gibberish), > so I had to implement some hacky workarounds which I'd like to remove if > possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672178#comment-16672178 ] Hans Brende commented on TIKA-2771: --- One good hint that this is not IBM500 is that *all* of the characters are printable US-ASCII characters, i.e., in the range 0x20 to 0x7E (whereas most of IBM500's printable characters are non-ASCII-printable). An even better hint that this is not IBM500 is that *all* of the characters are in the range of ASCII corresponding to *letters* (plus the space and colon characters). Hope that helps. > enableInputFilter() wrecks charset detection for some short html documents > -- > > Key: TIKA-2771 > URL: https://issues.apache.org/jira/browse/TIKA-2771 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.19.1 >Reporter: Hans Brende >Priority: Critical > > When I try to run the CharsetDetector on > http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange > most confident result of "IBM500" with a confidence of 60 when I enable the > input filter, *even if I set the declared encoding to UTF-8*. > This can be replicated with the following code: > {code:java} > CharsetDetector detect = new CharsetDetector(); > detect.enableInputFilter(true); > detect.setDeclaredEncoding("UTF-8"); > detect.setText(("\n" + > "\n" + > " http://schema.org/Person\; id=\"amanda\" > itemref=\"a b\">\n" + > " Name: Amanda\n" + > " Jazz Band\n" + > "").getBytes(StandardCharsets.UTF_8)); > Arrays.stream(detect.detectAll()).forEach(System.out::println); > {code} > which prints: > {noformat} > Match of IBM500 in fr with confidence 60 > Match of UTF-8 with confidence 57 > Match of ISO-8859-9 in tr with confidence 50 > Match of ISO-8859-1 in en with confidence 50 > Match of ISO-8859-2 in cs with confidence 12 > Match of Big5 in zh with confidence 10 > Match of EUC-KR in ko with confidence 10 > Match of EUC-JP in ja with confidence 10 > Match of GB18030 in zh with confidence 10 > Match of Shift_JIS in ja with confidence 10 > Match of UTF-16LE with confidence 10 > Match of UTF-16BE with confidence 10 > {noformat} > Note that if I do not set the declared encoding to UTF-8, the result is even > worse, with UTF-8 falling from a confidence of 57 to 15. > This is screwing up 1 out of 84 of my online microdata extraction tests over > in Any23 (as that particular page is being rendered into complete gibberish), > so I had to implement some hacky workarounds which I'd like to remove if > possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672134#comment-16672134 ] Hans Brende commented on TIKA-2771: --- I mean, because otherwise, if you're doing n-gram detection for IBM500 and you don't include adjustments to the confidence based on the length of the input, you should also be doing n-gram detection for UTF-8! Because if n-gram detection for IBM500 interpreted as "fr" gives a confidence of 60, then UTF-8 interpreted as "en" should give a confidence of at least 60 as well! > enableInputFilter() wrecks charset detection for some short html documents > -- > > Key: TIKA-2771 > URL: https://issues.apache.org/jira/browse/TIKA-2771 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.19.1 >Reporter: Hans Brende >Priority: Critical > > When I try to run the CharsetDetector on > http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange > most confident result of "IBM500" with a confidence of 60 when I enable the > input filter, *even if I set the declared encoding to UTF-8*. > This can be replicated with the following code: > {code:java} > CharsetDetector detect = new CharsetDetector(); > detect.enableInputFilter(true); > detect.setDeclaredEncoding("UTF-8"); > detect.setText(("\n" + > "\n" + > " http://schema.org/Person\; id=\"amanda\" > itemref=\"a b\">\n" + > " Name: Amanda\n" + > " Jazz Band\n" + > "").getBytes(StandardCharsets.UTF_8)); > Arrays.stream(detect.detectAll()).forEach(System.out::println); > {code} > which prints: > {noformat} > Match of IBM500 in fr with confidence 60 > Match of UTF-8 with confidence 57 > Match of ISO-8859-9 in tr with confidence 50 > Match of ISO-8859-1 in en with confidence 50 > Match of ISO-8859-2 in cs with confidence 12 > Match of Big5 in zh with confidence 10 > Match of EUC-KR in ko with confidence 10 > Match of EUC-JP in ja with confidence 10 > Match of GB18030 in zh with confidence 10 > Match of Shift_JIS in ja with confidence 10 > Match of UTF-16LE with confidence 10 > Match of UTF-16BE with confidence 10 > {noformat} > Note that if I do not set the declared encoding to UTF-8, the result is even > worse, with UTF-8 falling from a confidence of 57 to 15. > This is screwing up 1 out of 84 of my online microdata extraction tests over > in Any23 (as that particular page is being rendered into complete gibberish), > so I had to implement some hacky workarounds which I'd like to remove if > possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents
[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672116#comment-16672116 ] Hans Brende commented on TIKA-2771: --- Not sure if this is a contributing factor, but peering into the source code reveals that IBM500 is based on ngrams with a space character of 0x20. But the space character for IBM500 is actually 0x40. Also, it appears that the confidence for IBM500 is obtained by multiplying the raw fractional percentage of ngram hits by 300%. Is that number arbitrary? Shouldn't the "confidence" decrease by a lot if the length of the input is very small, and therefore not very statistically significant? > enableInputFilter() wrecks charset detection for some short html documents > -- > > Key: TIKA-2771 > URL: https://issues.apache.org/jira/browse/TIKA-2771 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.19.1 >Reporter: Hans Brende >Priority: Critical > > When I try to run the CharsetDetector on > http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange > most confident result of "IBM500" with a confidence of 60 when I enable the > input filter, *even if I set the declared encoding to UTF-8*. > This can be replicated with the following code: > {code:java} > CharsetDetector detect = new CharsetDetector(); > detect.enableInputFilter(true); > detect.setDeclaredEncoding("UTF-8"); > detect.setText(("\n" + > "\n" + > " http://schema.org/Person\; id=\"amanda\" > itemref=\"a b\">\n" + > " Name: Amanda\n" + > " Jazz Band\n" + > "").getBytes(StandardCharsets.UTF_8)); > Arrays.stream(detect.detectAll()).forEach(System.out::println); > {code} > which prints: > {noformat} > Match of IBM500 in fr with confidence 60 > Match of UTF-8 with confidence 57 > Match of ISO-8859-9 in tr with confidence 50 > Match of ISO-8859-1 in en with confidence 50 > Match of ISO-8859-2 in cs with confidence 12 > Match of Big5 in zh with confidence 10 > Match of EUC-KR in ko with confidence 10 > Match of EUC-JP in ja with confidence 10 > Match of GB18030 in zh with confidence 10 > Match of Shift_JIS in ja with confidence 10 > Match of UTF-16LE with confidence 10 > Match of UTF-16BE with confidence 10 > {noformat} > Note that if I do not set the declared encoding to UTF-8, the result is even > worse, with UTF-8 falling from a confidence of 57 to 15. > This is screwing up 1 out of 84 of my online microdata extraction tests over > in Any23 (as that particular page is being rendered into complete gibberish), > so I had to implement some hacky workarounds which I'd like to remove if > possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)