Tika currently misuses the HTTP Content-Encoding header, and does not seem to
use the charset part of the Content-Type header properly.
---------------------------------------------------------------------------------------------------------------------------------------
Key: TIKA-431
URL: https://issues.apache.org/jira/browse/TIKA-431
Project: Tika
Issue Type: Bug
Components: general
Reporter: Erik Hetzner
Tika currently misuses the HTTP Content-Encoding header, and does not seem to
use the charset part of the Content-Type header properly.
Content-Encoding is not for the charset. It is for values like gzip, deflate,
compress, or identity.
Charset is passed in with the Content-Type. For instance: text/html;
charset=iso-8859-1
Tika should, in my opinion, do the following:
1. Stop using Content-Encoding, unless it wants me to be able to pass in
gzipped content in an input stream.
2. Parse and understand charset=... declarations if passed in the Metadata
object
3. Return charset=... declarations in the Metadata object if a charset is
detected.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.