[
https://issues.apache.org/jira/browse/ANY23-411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16664373#comment-16664373
]
Hudson commented on ANY23-411:
------------------------------
SUCCESS: Integrated in Jenkins build Any23-trunk #1634 (See
[https://builds.apache.org/job/Any23-trunk/1634/])
ANY23-411 fix encoding detector (hans: rev
0aa3d54c41aa90d6dce5aa790f6f490c82e7c7f3)
* (edit) api/src/main/java/org/apache/any23/encoding/EncodingDetector.java
* (edit)
encoding/src/main/java/org/apache/any23/encoding/TikaEncodingDetector.java
* (edit)
core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java
> Use Content-Type to help determine encoding
> -------------------------------------------
>
> Key: ANY23-411
> URL: https://issues.apache.org/jira/browse/ANY23-411
> Project: Apache Any23
> Issue Type: Bug
> Components: encoding
> Affects Versions: 2.3
> Reporter: Hans Brende
> Assignee: Hans Brende
> Priority: Major
> Fix For: 2.3
>
>
> Incredibly enough, it seems that our encoding detector does not take the
> Content-Type header into account at all when trying to guess a document's
> charset encoding!
> This has caused a problem for me with the page:
> http://w3c.github.io/microdata-rdf/tests/0065.html
> Even though the Content-Type header is set to "text/html; charset=utf-8",
> we're guessing the charset to be: "IBM500", which in turn renders the page
> into complete gibberish.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)