[
https://issues.apache.org/jira/browse/TIKA-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13740580#comment-13740580
]
Tim Allison commented on TIKA-1001:
---
Fixed as of r1514126. Thank you for submitting this issue with test file!
tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6
charset
-
Key: TIKA-1001
URL: https://issues.apache.org/jira/browse/TIKA-1001
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.2
Reporter: david lemon
Attachments: badarabic.html, TIKA-1001v1.tar.gz
attached document extracts correctly in Tika 1.1
attached document extracts incorrectly in tika 1.2.
The difference appears to be that tika 1.1 honors the http meta content-type
tag which specifies the charset as iso-8859-6, and correctly converts the
output to UTF-8.
tika 1.2 appears to ignore the charset specified in the meta tag.
Some noodling seems to indicate that the problem is the charset.
it doesn't matter what mode tika is used in (server, app mode, etc. even if
content-type is specified with a charset, the output is still garbage).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira