[
https://issues.apache.org/jira/browse/TIKA-431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107214#comment-13107214
]
Ken Krugler commented on TIKA-431:
----------------------------------
Hi Robert,
I'm assuming you're talking about the case where all we have is the server
response header (versus the case where it's in the HTML meta tag), right?
If so, then I agree with you - I think it would be better to not trust that.
Given what I've seen coming back from web servers, they lie too often :) Though
the iCU detection code isn't very good either, as I found out after doing an
analysis.
Anyway, if I made that change, then the current code would go ahead and pass it
as the hint to ICU.
> Tika currently misuses the HTTP Content-Encoding header, and does not seem to
> use the charset part of the Content-Type header properly.
> ---------------------------------------------------------------------------------------------------------------------------------------
>
> Key: TIKA-431
> URL: https://issues.apache.org/jira/browse/TIKA-431
> Project: Tika
> Issue Type: Bug
> Components: general
> Reporter: Erik Hetzner
> Assignee: Ken Krugler
> Attachments: TIKA-431.patch
>
>
> Tika currently misuses the HTTP Content-Encoding header, and does not seem to
> use the charset part of the Content-Type header properly.
> Content-Encoding is not for the charset. It is for values like gzip, deflate,
> compress, or identity.
> Charset is passed in with the Content-Type. For instance: text/html;
> charset=iso-8859-1
> Tika should, in my opinion, do the following:
> 1. Stop using Content-Encoding, unless it wants me to be able to pass in
> gzipped content in an input stream.
> 2. Parse and understand charset=... declarations if passed in the Metadata
> object
> 3. Return charset=... declarations in the Metadata object if a charset is
> detected.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira