[
https://issues.apache.org/jira/browse/PDFBOX-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13045825#comment-13045825
]
Thomas Fischer commented on PDFBOX-1017:
----------------------------------------
>> 1. Is this a standard behaviour for PDF files created with OpenOffice or
>> NeoOffice with usage of ligatures activated?
> I don't know. I guess you have to ask the OO-people.
By now I found that this is the behaviour of the "Linux Libertine" font. After
installation of that font the ligatures were presented correctly. So it is at
least some standard.
>> 2. If this is the case, is there a way to "teach" PDFBox to transform these
>> ligatures to the respective two- or three-character resolutions in the same
>> way that PDFBox resolves the TeX-created ligatures fi, fl etc.?
> In the given case you are able to provide a suitable mapping, because you as
> a human can simply compare the given pdf and the extracted text. But without
> having a standard mapping it is impossible to implement a piece of software,
> which don't have to be adjusted every time you'll find a new individual
> mapping.
It seems to me that situations like this are not uncommon: I have a certain
corpus of documents and find when analysing them some non-standard characters,
ligatures, symbols… common to all or at least many documents.
It would be very helpful if I could create a table (say of the complexity of
the "additional_glyphlist.txt") that tells pdfbox (probably after recompiling)
how to translate these characters or symbols into some other Unicode
characters. I suppose that the option of this sort of fine-tuning of pdfbox
would be a feature request.
> Some Ligatures in a PDF file are not recognised.
> ------------------------------------------------
>
> Key: PDFBOX-1017
> URL: https://issues.apache.org/jira/browse/PDFBOX-1017
> Project: PDFBox
> Issue Type: Improvement
> Components: Text extraction
> Affects Versions: 1.6.0
> Environment: Mac OS X 10.6.7, java version "1.6.0_24"
> Reporter: Thomas Fischer
> Labels: textExtraction
> Attachments: Ligatures.pdf, Ligatures.txt
>
>
> In the attached file, some ligatures (Qu, Th, ch, ck, fft, ft, tt) are not
> transformed but remain in the text with Unicode characters in the private
> range UE0xx: "...im rabbinisen Sritum in untersiedlien Kontexten und
> dort,..."
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira