[
https://issues.apache.org/jira/browse/TIKA-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Angela Onslow updated TIKA-2054:
--------------------------------
Attachment: 2482_2014_DAVIDE+CAMPARI-MILANO+SPA_SUSTY-AR.pdf
Here is a file which demonstrates this problem
> Problem with ligatures converting from PDF to HTML with Tika
> ------------------------------------------------------------
>
> Key: TIKA-2054
> URL: https://issues.apache.org/jira/browse/TIKA-2054
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.11, 1.13
> Reporter: Angela Onslow
> Attachments: 2482_2014_DAVIDE+CAMPARI-MILANO+SPA_SUSTY-AR.pdf
>
>
> When converting certain PDFs from PDF to HTML I am having trouble with
> ligature characters being displayed as U+FFFD � REPLACEMENT CHARACTER
> I have tried using Apache Tika 1.11 and 1.13, converting on the command line
> using the .jar and get the same results.
> If I use pdfbox-app-2.0.1.jar and 'ExtractText' with the icu4j-57_1.jar in
> the path and I convert to text rather than HTML then I am able to at least
> preserve information about what each ligature was originally, even if they
> are still represented as unprintable characters.
> I.e. if I run the following from the command line:
> java -jar pdfbox-app-1.8.12.jar ExtractText 'test.pdf' 'test.txt'
> Then the resulting test.txt when viewed in Sublime2 has "fi" represented as
> the US (unit separator character), "ff" represented as RS, "fl" represented
> as GS and "ffl" reperesented as FS, which I could then replace with the
> appropriate characters.
> I was under the impression Tika uses icu4j, is there a way to get the same
> behaviour I see with PDFBox with Tika when converting from PDF to HTML?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)