[
https://issues.apache.org/jira/browse/TIKA-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15418755#comment-15418755
]
Tim Allison commented on TIKA-2054:
-----------------------------------
I don't think we want to modify our SafeContentHandler to stop converting
control characters.
This is difficult. If I understand correctly, PDFBox complains that the
ligatures aren't correctly encoded:
{noformat}
Aug 12, 2016 8:03:21 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for f_i (31) in font XOILAG+MyriadPro-Bold
Aug 12, 2016 8:03:21 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for f_i (31) in font XOILAG+MyriadPro-Regular
Aug 12, 2016 8:03:21 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for f_f (30) in font XOILAG+MyriadPro-Regular
Aug 12, 2016 8:03:21 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for f_l (29) in font XOILAG+MyriadPro-Regular
Aug 12, 2016 8:03:22 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for f_f_i (28) in font XOILAG+MyriadPro-Regular
{noformat}
So "fi" is being mapped to "0x1f" (31), "ff" to "0x1e" (30), and, as you point
out, you can recover these by a custom mapping in the output of PDFBox. Tika
via its SafeContentHandler converts most chars < 0x20 to '\ufffd'.
Adobe Reader seems to do the same thing that PDFBox does, but Microsoft Edge is
able to correctly extract e.g. "confidentiality"...not sure how that is
happening?!
> Problem with ligatures converting from PDF to HTML with Tika
> ------------------------------------------------------------
>
> Key: TIKA-2054
> URL: https://issues.apache.org/jira/browse/TIKA-2054
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.11, 1.13
> Reporter: Angela O
> Attachments: 2482_2014_DAVIDE+CAMPARI-MILANO+SPA_SUSTY-AR.pdf
>
>
> When converting certain PDFs from PDF to HTML I am having trouble with
> ligature characters being displayed as U+FFFD � REPLACEMENT CHARACTER
> I have tried using Apache Tika 1.11 and 1.13, converting on the command line
> using the .jar and get the same results.
> If I use pdfbox-app-2.0.1.jar and 'ExtractText' with the icu4j-57_1.jar in
> the path and I convert to text rather than HTML then I am able to at least
> preserve information about what each ligature was originally, even if they
> are still represented as unprintable characters.
> I.e. if I run the following from the command line:
> java -jar pdfbox-app-1.8.12.jar ExtractText 'test.pdf' 'test.txt'
> Then the resulting test.txt when viewed in Sublime2 has "fi" represented as
> the US (unit separator character), "ff" represented as RS, "fl" represented
> as GS and "ffl" reperesented as FS, which I could then replace with the
> appropriate characters.
> I was under the impression Tika uses icu4j, is there a way to get the same
> behaviour I see with PDFBox with Tika when converting from PDF to HTML?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)