[ https://issues.apache.org/jira/browse/TIKA-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13742266#comment-13742266 ]
Tim Allison commented on TIKA-1001: ----------------------------------- David, Thank you for submitting this. I fixed the issue triggered by your file and a few other variants that occurred to me. I wouldn't be surprised if we'll need to make more modifications. Please submit any other issues you find. Thank you, again. > tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6 > charset > --------------------------------------------------------------------------------- > > Key: TIKA-1001 > URL: https://issues.apache.org/jira/browse/TIKA-1001 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.2 > Reporter: david lemon > Attachments: badarabic.html, TIKA-1001v1.tar.gz > > > attached document extracts correctly in Tika 1.1 > attached document extracts incorrectly in tika 1.2. > The difference appears to be that tika 1.1 honors the http meta content-type > tag which specifies the charset as iso-8859-6, and correctly converts the > output to UTF-8. > tika 1.2 appears to ignore the charset specified in the meta tag. > Some noodling seems to indicate that the problem is the charset. > it doesn't matter what mode tika is used in (server, app mode, etc. even if > content-type is specified with a charset, the output is still garbage). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira