[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-08 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680557#comment-16680557
 ] 

Hans Brende commented on TIKA-2771:
---

[~talli...@apache.org] Great! I will definitely check that out. Detecting 
common words is a great idea!

One concern I'd have with the oversampling of non-UTF-8 pages is as follows: 
doing so may train encoding detectors to be more confident in non-UTF-8 
encodings than they should be. An encoding detector that is optimized for a 
scenario in which various encodings have "equal representation", so to speak, 
may actually have a much poorer accuracy for real-life charset distribution 
(given that > 92% of the web is UTF-8).

But I'm curious to hear your rationale as well.

(I suppose also that this concern could possibly be mitigated by assigning 
greater weight to UTF-8 detection successes & failures than that of other 
charsets, in a manner proportional to their actual occurrences on the web.)

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.
> EDIT: This issue may be related to TIKA-2737 and [this 
> comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-08 Thread Nick Burch (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680404#comment-16680404
 ] 

Nick Burch commented on TIKA-2771:
--

I'm not sure we do. We have documents along with the encoding that their web 
server claimed them to have, but I'm not sure I'd trust those server reported 
ones very far

It's possible that you could find some useful files in the ICU4J test suite, 
though those would probably need wrapping in html tags as I think they're 
mostly (all?) plain text

Otherwise, it might be worth finding some "complicated" and "typical" files 
from a range of non-English languages, in known encodings, then use iconv (or 
similar) to convert them to other encodings to test with

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.
> EDIT: This issue may be related to TIKA-2737 and [this 
> comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-08 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680452#comment-16680452
 ] 

Tim Allison commented on TIKA-2771:
---

[~HansBrende], funny you mention that...as [~gagravarr] pointed out we now have 
header info, but that is, um, interesting.

I recently refreshed our large scale regression corpus: TIKA-2750.  See my wiki 
entry/blog post that describes how I did that: 
https://wiki.apache.org/tika/CommonCrawl3 .  You can download the sampling 
frames I used here: 
http://162.242.228.174/share/commoncrawl3/sampling_frames.zip


So, while it would be great if we had ground truth, I had planned to use 
tika-eval's out of vocabulary (OOV) metrics 
(https://wiki.apache.org/tika/TikaEval) to compare the text as extracted based 
on different detectors... the idea is that encoding detectors are not likely to 
"hallucinate" common words if they're wrong.





> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.
> EDIT: This issue may be related to TIKA-2737 and [this 
> comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-08 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680378#comment-16680378
 ] 

Hans Brende commented on TIKA-2771:
---

[~talli...@apache.org] Does Tika have a corpus of documents paired with 
expected encodings that I can use to test out the PR I mentioned? I'm very 
curious to find out how much of an improvement it is.

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.
> EDIT: This issue may be related to TIKA-2737 and [this 
> comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-06 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677340#comment-16677340
 ] 

Hans Brende commented on TIKA-2771:
---

[~talli...@apache.org] I've implemented my ideas for charset detection 
improvement in [this pull request|https://github.com/apache/any23/pull/131] 
over in Any23. If you have time to look it over, I'd appreciate any feedback 
you can give. Some of it might be useful for Tika as well (as I've taken into 
account a few open issues here).

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.
> EDIT: This issue may be related to TIKA-2737 and [this 
> comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-06 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676926#comment-16676926
 ] 

Hans Brende commented on TIKA-2771:
---

One thing I am sure of, however, is that if your chances of getting a false 
positive for a given charset is *greater* than your chances of actually finding 
that charset "in the wild", then it is counterproductive to try to detect it in 
the first place.

That goes, not just for IBM500, but for anything that isn't UTF-8. Given that > 
90% of the web is UTF-8 (and the web, correct me if I'm wrong, seems to be the 
primary use-case for charset detection), a charset detector whose strategy is 
simply: {code:java}return "UTF-8";{code} is going to be at least 90% accurate. 
Source: https://w3techs.com/technologies/overview/character_encoding/all

So detection of any charsets *other than* UTF-8 needs to increase the accuracy 
to something *greater than* 90%, otherwise the false positives will actually 
*decrease* the overall accuracy! 

(I bring this up because I noticed in a different issue thread (TIKA-2038) that 
it was mentioned that Tika [is only 72% 
accurate|https://issues.apache.org/jira/browse/TIKA-2038?focusedCommentId=15830525=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15830525].
 Am I missing something here? Would we really get a more confident charset for 
webpages by simply guessing *everything* to be UTF-8?)

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.
> EDIT: This issue may be related to TIKA-2737 and [this 
> comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-06 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676887#comment-16676887
 ] 

Hans Brende commented on TIKA-2771:
---

Compare to the following analogous test for ISO-8859-1 variants:

||Issue||ISO-8859-X||byteMap'ed||2nd best||p||n||Wilson L.B.||IBM500 L.B.|
|TIKA-771|"Hello, World!"|"hello  world "|ISO-8859-1(it)|23%|10|*p' = 7%*|7%|
|TIKA-868|"Indanyl"|"indanyl"|ISO-8859-9(tr)|37%|7|*p' = 12%*|13%|
|TIKA-2771|"Name: Amanda\nJazz Band"|"name  amanda jazz 
band"|ISO-8859-1(en)|54%|18|*p' = 32%*|24%|

To calculate the Wilson lower bound, I used a confidence of 95% (i.e., z = 
1.96).

I'm not saying that the Wilson lower bound is *the* way to go (as you can see, 
it wasn't quite enough to fix TIKA-868, although it did reduce the discrepancy 
from 60% - 37% = *23%* to 13% - 12% = *1%*). So this method might need some 
adjustments. However, it does seem to represent a significant improvement over 
the way things are *now*.

*A simpler alternative would be to simply discard charsets which get less 
alphabetic text out of the input than a different charset does. This method 
would be successful for all 3 of the test cases I've presented here.* 

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.
> EDIT: This issue may be related to TIKA-2737 and [this 
> comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676109#comment-16676109
 ] 

Hans Brende commented on TIKA-2771:
---

[~talli...@apache.org] I did a little experimentation with each of the input 
texts that were causing trouble, for TIKA-771, TIKA-868, and this issue.

Here are the results:
||Issue||Input Text||IBM500||byteMap'ed||best||p||n||Wilson L.B.||
|TIKA-771|"Hello, World!"|"çÁ%%?  ï?Ê%À "|"çá ï ê à "|IBM500(fr)|30%|5|*p' 
= 7%*|
|TIKA-868|"Indanyl"|"ñ>À/>`%"|"ñ à"|IBM500(fr)|60%|2|*p' = 13%*|
|TIKA-2771|"Name: Amanda\nJazz Band"|"+/\_Á   \_/>À/ [/:: â/>À"|"   á  à
   â  à"|IBM500(fr)|66%|4|*p' = 24%*|

One thing is evident to me from this test: it's not the mapping of control & 
punctuation chars to 0x20 that's the problem (the byteMap for ISO-8859-1 *also* 
strips control & punct chars by mapping them to whitespace)! Rather, the 
problem lies in the fact that under IBM500, much of the text is *likely* to be 
mapped to punctuation & control chars, *but the confidence is not reduced when 
the amount of actual alphabetic text being tested shrinks to near-zero.* 

The lower bound of the Wilson score confidence interval, however, seems to give 
a much better estimate of our actual confidence based on the number of 
characters we actually end up testing. (And while the initial "confidence" 
value is to some extent arbitrary, the importance of the Wilson lower bound is 
not the final number we get out, but that we are reducing confidences 
*relative* to the confidences of other charsets that succeeded in getting more 
alphabetic text out of the input, and *relative* to the confidences of any 
declared charsets.)

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.
> EDIT: This issue may be related to TIKA-2737 and [this 
> comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675880#comment-16675880
 ] 

Hans Brende commented on TIKA-2771:
---

[~wave] Yep, just ran the following
{code:java}
IntStream.range(0, 256).forEach(i -> System.out.println(Integer.toHexString(i) 
+ ": "
+ new String(new byte[]{(byte)i}, 
Charset.forName("IBM500")).codePoints()
.mapToObj(Integer::toHexString).collect(Collectors.joining(" 
";
{code}

and it does look like IBM500 has a 1-to-1 mapping to latin-1. That should 
simplify the fix.

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.
> EDIT: This issue may be related to TIKA-2737 and [this 
> comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675839#comment-16675839
 ] 

Tim Allison commented on TIKA-2771:
---

I was thinking something similar...

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.
> EDIT: This issue may be related to TIKA-2737 and [this 
> comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675828#comment-16675828
 ] 

Hans Brende commented on TIKA-2771:
---

[~talli...@apache.org] Ah, you're correct as regards the byteMap. The TODO 
comment threw me.

However, on closer inspection of the IBM500 byteMap, I see an even more 
alarming issue: 118 out of the 256 bytes map to 0x20!!! 

But only 0x40 should map to 0x20. 

This could explain why so many false positives for IBM500 are occurring: *all 
special characters are mapped to spaces, and then simply ignored*. But in order 
to have accurate n-gram measurements, those special characters need to be 
included in the calculations, I believe. I'm not sure, but perhaps they should 
be mapped to 0x00 instead of 0x20?



> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.
> EDIT: This issue may be related to TIKA-2737 and [this 
> comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675740#comment-16675740
 ] 

Tim Allison commented on TIKA-2771:
---

Got it.  Thank you.

bq. which calls: match(det, ngrams, byteMap, (byte) 0x20);

I agree, but doesn't the byteMap that is passed in map 0x40 to 0x20 during the 
actual match calculations?

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.
> EDIT: This issue may be related to TIKA-2737 and [this 
> comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675708#comment-16675708
 ] 

Hans Brende commented on TIKA-2771:
---

[~talli...@apache.org] I'm not sure which all of the charsets are 
ASCII-incompatible; I just implemented the check as follows:

{code:java}
if (tagsWereStripped && !"< />".equals(new String("< />".getBytes(), charset))) 
{
confidence = 0;
}
{code}

Re: space char:

{{public CharsetMatch match(CharsetDetector det)}} (CharsetRecog_sbcs.java: 
1279)
calls 
{{int match(CharsetDetector det, int[] ngrams, byte[] byteMap)}} 
(CharsetRecog_sbcs.java: 40)
which calls:
{{match(det, ngrams, byteMap, (byte) 0x20);}}

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.
> EDIT: This issue may be related to TIKA-2737 and [this 
> comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675527#comment-16675527
 ] 

Tim Allison commented on TIKA-2771:
---

I'm happy enough adding this check into EBCDIC500.  Are there any other 
charsets for which we should do the same (e.g. where 3C and 3E do not == '<' 
and '>')?

At some point, I started work on a stripper that took into consideration 
comments, script, style, etc  I think we might want to look into that again 
at some point.  We may be able to use the {{PreScanner}} or some of the other 
code that arrived recently in {{o.a.t.parsers.html.charsetdetector}} package.

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.
> EDIT: This issue may be related to TIKA-2737 and [this 
> comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675520#comment-16675520
 ] 

Tim Allison commented on TIKA-2771:
---

When I add a {{tagsWereStripped}}, and have the EBCDIC500 charsets return null 
if that is {{true}}, I get these results with the "declared encoding"="UTF-8" 
and enable filter:
{noformat}
Match of UTF-8 with confidence 57
Match of ISO-8859-9 in tr with confidence 50
Match of ISO-8859-1 in en with confidence 50
Match of ISO-8859-2 in cs with confidence 12
Match of Big5 in zh with confidence 10
...
{noformat}

With declared encoding = "UTF-8" and don't enable filter:
{noformat}
Match of UTF-8 with confidence 57
Match of ISO-8859-1 in en with confidence 31
Match of ISO-8859-9 in tr with confidence 19
Match of ISO-8859-2 in ro with confidence 15
...
{noformat}

With no declared encoding and "enableFilter=true":
{noformat}
Match of ISO-8859-9 in tr with confidence 50
Match of ISO-8859-1 in en with confidence 50
Match of UTF-8 with confidence 15
Match of ISO-8859-2 in cs with confidence 12
...
{noformat}

With no declared encoding and "enableFilter=false"
{noformat}
Match of ISO-8859-1 in en with confidence 31
Match of ISO-8859-9 in tr with confidence 19
Match of ISO-8859-2 in ro with confidence 15
Match of UTF-8 with confidence 15
...
{noformat}

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.
> EDIT: This issue may be related to TIKA-2737 and [this 
> comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-05 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675511#comment-16675511
 ] 

Tim Allison commented on TIKA-2771:
---

Let me try again.  I _think_ I've re-engaged my brain before I started typing 
this time.  Thank you for your patience.

bq. But since you've already modified it by supporting EBCDIC charsets...
+1

bq. (1) if any tags are stripped from the input (using 0x3C and 0x3E), that 
should automatically make the confidence for all EBCDIC charsets be zero. 

Y. I agree with this because the code currently fails to strip if there are too 
many {{badTags}}.

bq. (2) n-gram detection needs to happen using the proper space character (in 
this case, 0x40)
I agree with your point, but to confirm I understand our code, I _think_ we do 
this mapping  in EBCDIC's {{byteMap}} 
([here|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecog_sbcs.java#L1229]).
  We do map 0x40 (and a bunch of other stuff) to 0x20.

bq. (For my last thought, I'd recommend taking a look at the Wilson Score 
interval found here: 
https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval )

I agree that measuring confidence makes a great deal of sense, and perhaps 
Wilson is the right way to go.  However, I'd want to re-think how the stats 
were compiled, how the score is computed and whether there is an improvement as 
part of adding a confidence measurement.  The CharsetDetector, as it stands, 
has quite a bit of hackery in it, and I'd be concerned that adding a confidence 
interval on top of a somewhat, um, heuristic, score might give the wrong 
impression.  In short, I agree, but I'd want to do a bunch more work, 
including, potentially, redoing how the scores are calculated.

bq. "We no longer actively developing the charset detector function."
Yikes.  Thank you for pointing that out!


> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.
> EDIT: This issue may be related to TIKA-2737 and [this 
> comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-02 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673481#comment-16673481
 ] 

Hans Brende commented on TIKA-2771:
---

(Also relating to my last thought, on the subject of "waiting for icu4j", I 
see: 

"We no longer actively developing the charset detector function."

from https://unicode-org.atlassian.net/browse/ICU-13465 )

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-02 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673429#comment-16673429
 ] 

Hans Brende commented on TIKA-2771:
---

(For my last thought, I'd recommend taking a look at this: 
https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval )

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-02 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673400#comment-16673400
 ] 

Hans Brende commented on TIKA-2771:
---

[~talli...@apache.org] I totally understand not wanting to modify ICU4J's code. 
But since you've *already* modified it by supporting EBCDIC charsets, that 
unfortunately is going to require additional modifications, since EBCDIC is not 
ASCII-compatible. E.g., in the {{MungeInput()}} method, an ASCII-compatible 
charset that maps 0x3C to "<" and 0x3E to ">" is *presupposed*. And then, the 
space character in IBM500 is not 0x20, but rather *0x40*. As a *bare minimum* 
set of modifications to the CharsetDetector class, I'd recommend the following: 

(1) if *any* tags are stripped from the input (using 0x3C and 0x3E), that 
should automatically make the confidence for all EBCDIC charsets be zero. 
(2) n-gram detection needs to happen using the proper space character (in this 
case, 0x40)

I'd also highly recommend lowering the confidence of n-gram detection for 
shorter text. If the "declared encoding" is compatible with the entire input 
text, but an n-gram detector assigns a confidence of 60 to a different encoding 
based on accidental n-gram detection due to the shortness of the text, the 
declared encoding should take precedence (esp. if the declared encoding is 
UTF-8 and the accidental encoding is, for all practical purposes, used almost 
nowhere). This last issue might, as you say, be an issue for icu4j... however, 
one advantage to copying their code over is the very fact that you don't *have* 
to wait on them to improve your own code. Just a thought.

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-02 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673220#comment-16673220
 ] 

Tim Allison commented on TIKA-2771:
---

  let me re-engage brain before typing again...sorry.

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-02 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673215#comment-16673215
 ] 

Hans Brende commented on TIKA-2771:
---

[~talli...@apache.org] IBM500 (a.k.a. EBCDIC 500) is an EBCDIC charset. So, I'm 
confused: are you saying we should ask icu4j to support EBCDIC charsets, and 
then recopy their code?

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-02 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673188#comment-16673188
 ] 

Tim Allison commented on TIKA-2771:
---

[~HansBrende], thank you for raising this issue and sharing this with us.  
Let's figure out how to fix this.

Charset detection on short strings is always problematic.

The CharsetDetector is a copy/paste from icu4j.  I _think_ the only difference 
is that we've added EBCDIC charsets that icu4j didn't want to support.  While I 
agree with you on the above, I'd much prefer to get the changes into ICU4j than 
to modify our fork and then try to maintain that delta when we next copy/paste 
from ICU4j.

If you still think there's a need to make modifications to our preprocessing, 
I'd be open to that, but the actual algorithmic changes should be made 
upstream, IMHO.

We do have a charset-override option which will allow you to say "treat this as 
(e.g.) UTF-8" no matter what detection says. Set whatever encoding you want in 
the Metadata object with this key: {{TikaCoreProperties.CONTENT_TYPE_OVERRIDE}} 
and then ask the AutoDetectReader or the DefaultEncodingDetector to read your 
bytes.  This will not shortcut our copy of ICU4j's CharsetDetector because it 
relies on the OverrideDetector being called first within the DefaultDetector.

 



> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672520#comment-16672520
 ] 

Hans Brende commented on TIKA-2771:
---

Just had another thought: when the input filter is enabled, it strips 
everything within "<" and ">" brackets (i.e. 0x3C and 0x3E), correct? But doing 
so *presupposes* an ASCII-compatible encoding! Thus, if a significant number of 
matching "<" and ">" symbols are found, you *already* know it can't be IBM500! 
("<" and ">" in IBM500 are 0x4C and 0x6E, respectively.) I assume you could 
extend this logic to other ASCII-incompatible charsets as well.

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672203#comment-16672203
 ] 

Hans Brende commented on TIKA-2771:
---

(Source: https://w3techs.com/technologies/overview/character_encoding/all )

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672196#comment-16672196
 ] 

Hans Brende commented on TIKA-2771:
---

Oh... and probably the best hint of all that this is not IBM500 is that 
approximately 0.0% of html markup worldwide is encoded in IBM500.

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672178#comment-16672178
 ] 

Hans Brende commented on TIKA-2771:
---

One good hint that this is not IBM500 is that *all* of the characters are 
printable US-ASCII characters, i.e., in the range 0x20 to 0x7E (whereas most of 
IBM500's printable characters are non-ASCII-printable). An even better hint 
that this is not IBM500 is that *all* of the characters are in the range of 
ASCII corresponding to *letters* (plus the space and colon characters).

Hope that helps.

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672134#comment-16672134
 ] 

Hans Brende commented on TIKA-2771:
---

I mean, because otherwise, if you're doing n-gram detection for IBM500 and you 
don't include adjustments to the confidence based on the length of the input, 
you should also be doing n-gram detection for UTF-8! Because if n-gram 
detection for IBM500 interpreted as "fr" gives a confidence of 60, then UTF-8 
interpreted as "en" should give a confidence of at least 60 as well!

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

2018-11-01 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672116#comment-16672116
 ] 

Hans Brende commented on TIKA-2771:
---

Not sure if this is a contributing factor, but peering into the source code 
reveals that IBM500 is based on ngrams with a space character of 0x20. But the 
space character for IBM500 is actually 0x40. 

Also, it appears that the confidence for IBM500 is obtained by multiplying the 
raw fractional percentage of ngram hits by 300%. Is that number arbitrary? 
Shouldn't the "confidence" decrease by a lot if the length of the input is very 
small, and therefore not very statistically significant?

> enableInputFilter() wrecks charset detection for some short html documents
> --
>
> Key: TIKA-2771
> URL: https://issues.apache.org/jira/browse/TIKA-2771
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.19.1
>Reporter: Hans Brende
>Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("\n" +
> "\n" +
> "  http://schema.org/Person\; id=\"amanda\" 
> itemref=\"a b\">\n" +
> "  Name: Amanda\n" +
> "  Jazz Band\n" +
> "").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)