Karl Wright created CONNECTORS-1008:
---------------------------------------
Summary: Tika Extractor doesn't seem to handle special characters
correctly
Key: CONNECTORS-1008
URL: https://issues.apache.org/jira/browse/CONNECTORS-1008
Project: ManifoldCF
Issue Type: Bug
Affects Versions: ManifoldCF 1.7
Reporter: Karl Wright
Assignee: Karl Wright
Fix For: ManifoldCF 1.7
The Tika extractor, when extracting content from a PDF (specifically, the en_US
end-user-documentation pdf), does not handle anything other than Latin-1
characters properly. For example, it does not convert the copyright symbol, or
any CJK characters, to utf-8.
--
This message was sent by Atlassian JIRA
(v6.2#6252)