[
https://issues.apache.org/jira/browse/ANY23-411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hans Brende updated ANY23-411:
------------------------------
Description:
Incredibly enough, it seems that our encoding detector does not take the
Content-Type header into account at all when trying to guess a document's
charset encoding!
This has caused a problem for me with the page:
http://w3c.github.io/microdata-rdf/tests/0065.html
Even though the Content-Type header is set to "text/html; charset=utf-8", we're
guessing the charset to be: "IBM500", which in turn renders the page into
complete gibberish.
This must be a bug in Tika, because even when I set the declared encoding of
the charset detector to UTF-8, IBM500 is still the most confident result.
Cf. https://issues.apache.org/jira/browse/TIKA-2771
was:
Incredibly enough, it seems that our encoding detector does not take the
Content-Type header into account at all when trying to guess a document's
charset encoding!
This has caused a problem for me with the page:
http://w3c.github.io/microdata-rdf/tests/0065.html
Even though the Content-Type header is set to "text/html; charset=utf-8", we're
guessing the charset to be: "IBM500", which in turn renders the page into
complete gibberish.
> Use Content-Type to help determine encoding
> -------------------------------------------
>
> Key: ANY23-411
> URL: https://issues.apache.org/jira/browse/ANY23-411
> Project: Apache Any23
> Issue Type: Bug
> Components: encoding
> Affects Versions: 2.3
> Reporter: Hans Brende
> Assignee: Hans Brende
> Priority: Major
> Fix For: 2.3
>
>
> Incredibly enough, it seems that our encoding detector does not take the
> Content-Type header into account at all when trying to guess a document's
> charset encoding!
> This has caused a problem for me with the page:
> http://w3c.github.io/microdata-rdf/tests/0065.html
> Even though the Content-Type header is set to "text/html; charset=utf-8",
> we're guessing the charset to be: "IBM500", which in turn renders the page
> into complete gibberish.
> This must be a bug in Tika, because even when I set the declared encoding of
> the charset detector to UTF-8, IBM500 is still the most confident result.
> Cf. https://issues.apache.org/jira/browse/TIKA-2771
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)