Hans Brende created TIKA-2771: --------------------------------- Summary: enableInputFilter() wrecks charset detection for some short html documents Key: TIKA-2771 URL: https://issues.apache.org/jira/browse/TIKA-2771 Project: Tika Issue Type: Bug Components: detector Affects Versions: 1.19.1 Reporter: Hans Brende
When I try to run the CharsetDetector on http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange most confident result of "IBM500" with a confidence of 60 when I enable the input filter, *even if I set the declared encoding to UTF-8*. This can be replicated with the following code: {code:java} CharsetDetector detect = new CharsetDetector(); detect.enableInputFilter(true); detect.setDeclaredEncoding("UTF-8"); detect.setText(("<!DOCTYPE html>\n" + "<div>\n" + " <div itemscope itemtype=\"http://schema.org/Person\" id=\"amanda\" itemref=\"a b\"></div>\n" + " <p id=\"a\">Name: <span itemprop=\"name\">Amanda</span></p>\n" + " <p id=\"b\" itemprop=\"band\">Jazz Band</p>\n" + "</div>").getBytes(StandardCharsets.UTF_8)); Arrays.stream(detect.detectAll()).forEach(System.out::println); {code} which prints: {noformat} Match of IBM500 in fr with confidence 60 Match of UTF-8 with confidence 57 Match of ISO-8859-9 in tr with confidence 50 Match of ISO-8859-1 in en with confidence 50 Match of ISO-8859-2 in cs with confidence 12 Match of Big5 in zh with confidence 10 Match of EUC-KR in ko with confidence 10 Match of EUC-JP in ja with confidence 10 Match of GB18030 in zh with confidence 10 Match of Shift_JIS in ja with confidence 10 Match of UTF-16LE with confidence 10 Match of UTF-16BE with confidence 10 {noformat} Note that if I do not set the declared encoding to UTF-8, the result is even worse, with UTF-8 falling from a confidence of 57 to 15. This is screwing up 1 out of 84 of my online microdata extraction tests over in Any23 (as that particular page is being rendered into complete gibberish), so I had to implement some hacky workarounds which I'd like to remove if possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)