Karl Wright created CONNECTORS-1008:
---------------------------------------

             Summary: Tika Extractor doesn't seem to handle special characters 
correctly
                 Key: CONNECTORS-1008
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1008
             Project: ManifoldCF
          Issue Type: Bug
    Affects Versions: ManifoldCF 1.7
            Reporter: Karl Wright
            Assignee: Karl Wright
             Fix For: ManifoldCF 1.7


The Tika extractor, when extracting content from a PDF (specifically, the en_US 
end-user-documentation pdf), does not handle anything other than Latin-1 
characters properly.  For example, it does not convert the copyright symbol, or 
any CJK characters, to utf-8.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to