[jira] [Commented] (ANY23-411) Use Content-Type to help determine encoding

Hudson (JIRA) Thu, 25 Oct 2018 15:42:21 -0700


    [ 
https://issues.apache.org/jira/browse/ANY23-411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16664373#comment-16664373
 ]


Hudson commented on ANY23-411:
------------------------------

SUCCESS: Integrated in Jenkins build Any23-trunk #1634 (See 
[https://builds.apache.org/job/Any23-trunk/1634/])
ANY23-411 fix encoding detector (hans: rev 
0aa3d54c41aa90d6dce5aa790f6f490c82e7c7f3)
* (edit) api/src/main/java/org/apache/any23/encoding/EncodingDetector.java
* (edit) 
encoding/src/main/java/org/apache/any23/encoding/TikaEncodingDetector.java
* (edit) 
core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java


> Use Content-Type to help determine encoding
> -------------------------------------------
>
>                 Key: ANY23-411
>                 URL: https://issues.apache.org/jira/browse/ANY23-411
>             Project: Apache Any23
>          Issue Type: Bug
>          Components: encoding
>    Affects Versions: 2.3
>            Reporter: Hans Brende
>            Assignee: Hans Brende
>            Priority: Major
>             Fix For: 2.3
>
>
> Incredibly enough, it seems that our encoding detector does not take the 
> Content-Type header into account at all when trying to guess a document's 
> charset encoding!
> This has caused a problem for me with the page: 
> http://w3c.github.io/microdata-rdf/tests/0065.html
> Even though the Content-Type header is set to "text/html; charset=utf-8", 
> we're guessing the charset to be: "IBM500", which in turn renders the page 
> into complete gibberish. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ANY23-411) Use Content-Type to help determine encoding

Reply via email to