Tika currently misuses the HTTP Content-Encoding header, and does not seem to 
use the charset part of the Content-Type header properly.
---------------------------------------------------------------------------------------------------------------------------------------

                 Key: TIKA-431
                 URL: https://issues.apache.org/jira/browse/TIKA-431
             Project: Tika
          Issue Type: Bug
          Components: general
            Reporter: Erik Hetzner


Tika currently misuses the HTTP Content-Encoding header, and does not seem to 
use the charset part of the Content-Type header properly.

Content-Encoding is not for the charset. It is for values like gzip, deflate, 
compress, or identity.

Charset is passed in with the Content-Type. For instance: text/html; 
charset=iso-8859-1

Tika should, in my opinion, do the following:

1. Stop using Content-Encoding, unless it wants me to be able to pass in 
gzipped content in an input stream.

2. Parse and understand charset=... declarations if passed in the Metadata 
object

3. Return charset=... declarations in the Metadata object if a charset is 
detected.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to