tom hill created TIKA-3858:
------------------------------
Summary: Ligatures convert on text extraction
Key: TIKA-3858
URL: https://issues.apache.org/jira/browse/TIKA-3858
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.5
Environment: win 8, jre 1.5
Reporter: tom hill
Fix For: 1.7
According to tika sources review, it uses pdfbox to parse pdf files.
I found that pdfbox itself uses icu4j to handle ligatures.
Unfortunately, when i added icu4j jar to my classpath nothing changed,
ligatures are still not converted. Sample pdf file is attached.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)