[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

Tim Allison (JIRA) Fri, 02 Nov 2018 07:53:29 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16673188#comment-16673188
 ]


Tim Allison commented on TIKA-2771:
-----------------------------------

[~HansBrende], thank you for raising this issue and sharing this with us.  
Let's figure out how to fix this.

Charset detection on short strings is always problematic.

The CharsetDetector is a copy/paste from icu4j.  I _think_ the only difference 
is that we've added EBCDIC charsets that icu4j didn't want to support.  While I 
agree with you on the above, I'd much prefer to get the changes into ICU4j than 
to modify our fork and then try to maintain that delta when we next copy/paste 
from ICU4j.

If you still think there's a need to make modifications to our preprocessing, 
I'd be open to that, but the actual algorithmic changes should be made 
upstream, IMHO.

We do have a charset-override option which will allow you to say "treat this as 
(e.g.) UTF-8" no matter what detection says. Set whatever encoding you want in 
the Metadata object with this key: {{TikaCoreProperties.CONTENT_TYPE_OVERRIDE}} 
and then ask the AutoDetectReader or the DefaultEncodingDetector to read your 
bytes.  This will not shortcut our copy of ICU4j's CharsetDetector because it 
relies on the OverrideDetector being called first within the DefaultDetector.

 



> enableInputFilter() wrecks charset detection for some short html documents
> --------------------------------------------------------------------------
>
>                 Key: TIKA-2771
>                 URL: https://issues.apache.org/jira/browse/TIKA-2771
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 1.19.1
>            Reporter: Hans Brende
>            Priority: Critical
>
> When I try to run the CharsetDetector on 
> http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange 
> most confident result of "IBM500" with a confidence of 60 when I enable the 
> input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("<!DOCTYPE html>\n" +
>         "<div>\n" +
>         "  <div itemscope itemtype=\"http://schema.org/Person\"; id=\"amanda\" 
> itemref=\"a b\"></div>\n" +
>         "  <p id=\"a\">Name: <span itemprop=\"name\">Amanda</span></p>\n" +
>         "  <p id=\"b\" itemprop=\"band\">Jazz Band</p>\n" +
>         "</div>").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even 
> worse, with UTF-8 falling from a confidence of 57 to 15. 
> This is screwing up 1 out of 84 of my online microdata extraction tests over 
> in Any23 (as that particular page is being rendered into complete gibberish), 
> so I had to implement some hacky workarounds which I'd like to remove if 
> possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

Reply via email to