[ 
https://issues.apache.org/jira/browse/ANY23-411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende updated ANY23-411:
------------------------------
    Description: 
Incredibly enough, it seems that our encoding detector does not take the 
Content-Type header into account at all when trying to guess a document's 
charset encoding!

This has caused a problem for me with the page: 
http://w3c.github.io/microdata-rdf/tests/0065.html

Even though the Content-Type header is set to "text/html; charset=utf-8", we're 
guessing the charset to be: "IBM500", which in turn renders the page into 
complete gibberish. 

This must be a bug in Tika, because even when I set the declared encoding of 
the charset detector to UTF-8, IBM500 is still the most confident result.

Cf. https://issues.apache.org/jira/browse/TIKA-2771

  was:
Incredibly enough, it seems that our encoding detector does not take the 
Content-Type header into account at all when trying to guess a document's 
charset encoding!

This has caused a problem for me with the page: 
http://w3c.github.io/microdata-rdf/tests/0065.html

Even though the Content-Type header is set to "text/html; charset=utf-8", we're 
guessing the charset to be: "IBM500", which in turn renders the page into 
complete gibberish. 


> Use Content-Type to help determine encoding
> -------------------------------------------
>
>                 Key: ANY23-411
>                 URL: https://issues.apache.org/jira/browse/ANY23-411
>             Project: Apache Any23
>          Issue Type: Bug
>          Components: encoding
>    Affects Versions: 2.3
>            Reporter: Hans Brende
>            Assignee: Hans Brende
>            Priority: Major
>             Fix For: 2.3
>
>
> Incredibly enough, it seems that our encoding detector does not take the 
> Content-Type header into account at all when trying to guess a document's 
> charset encoding!
> This has caused a problem for me with the page: 
> http://w3c.github.io/microdata-rdf/tests/0065.html
> Even though the Content-Type header is set to "text/html; charset=utf-8", 
> we're guessing the charset to be: "IBM500", which in turn renders the page 
> into complete gibberish. 
> This must be a bug in Tika, because even when I set the declared encoding of 
> the charset detector to UTF-8, IBM500 is still the most confident result.
> Cf. https://issues.apache.org/jira/browse/TIKA-2771



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to