[ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16672520#comment-16672520 ]
Hans Brende commented on TIKA-2771: ----------------------------------- Just had another thought: when the input filter is enabled, it strips everything within "<" and ">" brackets (i.e. 0x3C and 0x3E), correct? But doing so *presupposes* an ASCII-compatible encoding! Thus, if a significant number of matching "<" and ">" symbols are found, you *already* know it can't be IBM500! ("<" and ">" in IBM500 are 0x4C and 0x6E, respectively.) I assume you could extend this logic to other ASCII-incompatible charsets as well. > enableInputFilter() wrecks charset detection for some short html documents > -------------------------------------------------------------------------- > > Key: TIKA-2771 > URL: https://issues.apache.org/jira/browse/TIKA-2771 > Project: Tika > Issue Type: Bug > Components: detector > Affects Versions: 1.19.1 > Reporter: Hans Brende > Priority: Critical > > When I try to run the CharsetDetector on > http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange > most confident result of "IBM500" with a confidence of 60 when I enable the > input filter, *even if I set the declared encoding to UTF-8*. > This can be replicated with the following code: > {code:java} > CharsetDetector detect = new CharsetDetector(); > detect.enableInputFilter(true); > detect.setDeclaredEncoding("UTF-8"); > detect.setText(("<!DOCTYPE html>\n" + > "<div>\n" + > " <div itemscope itemtype=\"http://schema.org/Person\" id=\"amanda\" > itemref=\"a b\"></div>\n" + > " <p id=\"a\">Name: <span itemprop=\"name\">Amanda</span></p>\n" + > " <p id=\"b\" itemprop=\"band\">Jazz Band</p>\n" + > "</div>").getBytes(StandardCharsets.UTF_8)); > Arrays.stream(detect.detectAll()).forEach(System.out::println); > {code} > which prints: > {noformat} > Match of IBM500 in fr with confidence 60 > Match of UTF-8 with confidence 57 > Match of ISO-8859-9 in tr with confidence 50 > Match of ISO-8859-1 in en with confidence 50 > Match of ISO-8859-2 in cs with confidence 12 > Match of Big5 in zh with confidence 10 > Match of EUC-KR in ko with confidence 10 > Match of EUC-JP in ja with confidence 10 > Match of GB18030 in zh with confidence 10 > Match of Shift_JIS in ja with confidence 10 > Match of UTF-16LE with confidence 10 > Match of UTF-16BE with confidence 10 > {noformat} > Note that if I do not set the declared encoding to UTF-8, the result is even > worse, with UTF-8 falling from a confidence of 57 to 15. > This is screwing up 1 out of 84 of my online microdata extraction tests over > in Any23 (as that particular page is being rendered into complete gibberish), > so I had to implement some hacky workarounds which I'd like to remove if > possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)