[
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
John Hewson resolved PDFBOX-2548.
---------------------------------
Resolution: Not a Problem
> Problems with character extraction (fi ligature)
> ------------------------------------------------
>
> Key: PDFBOX-2548
> URL: https://issues.apache.org/jira/browse/PDFBOX-2548
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.7
> Environment: Windows7Professional JavaSE8 EclipseKepler
> Reporter: Matthias Bösinger
> Priority: Minor
> Attachments: preflight.png, test.pdf, test2.pdf
>
>
> favorite
>
> I have a pdf document whose font type is OpenType (Garamond OpenType). So the
> pdfBox text extraction can also extract special characters (for example small
> capital lettres), which caused problems when the underlying font has been a
> simple Type1 font.
> However, the text extraction now causes another type of problem. In my case,
> when the charater sequences "fi" or "fl" occur in the text, the
> PDFTextStripper#getText(PDDocument doc) extracts them as single characters:
> 'fi' and 'fl' and sets a space character on their right side.
> (Surprisingly, if I access the list of characters of a page via the
> charactersByArticle field of PDFTextStripper / via the
> PDFTextStripper#processText(TextPosition pos) method, the same characters
> show up as 'normal-single' characters f i / f l).
> My assumption is that the advantage of the underlying OpenFont type turns
> into this particular disadvantage, because the PDFTextStripper recognizes the
> character sequence f i / f l as special charcters fi / fl (- what might have to
> do with the fact, that the getText() method calculates things like whitespace
> characters by distances / positional placements).
> Background: The given document is a wordbook text with very dense printed
> text.
> see this link for code and output:
> http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox
> My question: is there anything what I can do to avoid this problem?
> thanks in advance ...
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)