Hans Brende created TIKA-2771:
---------------------------------
Summary: enableInputFilter() wrecks charset detection for some
short html documents
Key: TIKA-2771
URL: https://issues.apache.org/jira/browse/TIKA-2771
Project: Tika
Issue Type: Bug
Components: detector
Affects Versions: 1.19.1
Reporter: Hans Brende
When I try to run the CharsetDetector on
http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange most
confident result of "IBM500" with a confidence of 60 when I enable the input
filter, *even if I set the declared encoding to UTF-8*.
This can be replicated with the following code:
{code:java}
CharsetDetector detect = new CharsetDetector();
detect.enableInputFilter(true);
detect.setDeclaredEncoding("UTF-8");
detect.setText(("<!DOCTYPE html>\n" +
"<div>\n" +
" <div itemscope itemtype=\"http://schema.org/Person\" id=\"amanda\"
itemref=\"a b\"></div>\n" +
" <p id=\"a\">Name: <span itemprop=\"name\">Amanda</span></p>\n" +
" <p id=\"b\" itemprop=\"band\">Jazz Band</p>\n" +
"</div>").getBytes(StandardCharsets.UTF_8));
Arrays.stream(detect.detectAll()).forEach(System.out::println);
{code}
which prints:
{noformat}
Match of IBM500 in fr with confidence 60
Match of UTF-8 with confidence 57
Match of ISO-8859-9 in tr with confidence 50
Match of ISO-8859-1 in en with confidence 50
Match of ISO-8859-2 in cs with confidence 12
Match of Big5 in zh with confidence 10
Match of EUC-KR in ko with confidence 10
Match of EUC-JP in ja with confidence 10
Match of GB18030 in zh with confidence 10
Match of Shift_JIS in ja with confidence 10
Match of UTF-16LE with confidence 10
Match of UTF-16BE with confidence 10
{noformat}
Note that if I do not set the declared encoding to UTF-8, the result is even
worse, with UTF-8 falling from a confidence of 57 to 15.
This is screwing up 1 out of 84 of my online microdata extraction tests over in
Any23 (as that particular page is being rendered into complete gibberish), so I
had to implement some hacky workarounds which I'd like to remove if possible.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)