[
https://issues.apache.org/jira/browse/PDFBOX-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Thomas Fischer updated PDFBOX-970:
----------------------------------
Attachment: Test2.pdf
Test2-1.6.txt
Test2.1.4.txt
I put a file icu-4.0.1.jar into my classpath and that essentially resolved the
umlaut issue, they are now represented as combined characters (I'm not quite
sure what search engines do with those). Nevertheless, pdfbox 1.4 didn't need
the additional icu, was the need introduced in a recent version change?
Unfortunately there are still some strange problems with the conversion,
essentially missing characters. I upload a new test file and conversions using
pdfbox 1.4 and 1.6 respectively; comparison shows the errors (and some
additional differences).
> TeX-created ligatures and umlauts are not recognised
> ----------------------------------------------------
>
> Key: PDFBOX-970
> URL: https://issues.apache.org/jira/browse/PDFBOX-970
> Project: PDFBox
> Issue Type: Bug
> Components: FontBox
> Affects Versions: 1.5.0
> Environment: Mac OS X 10.6.6, Java(TM) SE Runtime Environment (build
> 1.6.0_22-b04-307-10M3261)
> Reporter: Thomas Fischer
> Labels: textExtraction
> Attachments: A Python Library for Provenance Recording and
> Querying.txt, A Python Library for Provenance Recording and Querying.txt,
> Test.pdf, Test.pdf, Test2-1.6.txt, Test2.1.4.txt, Test2.pdf
>
>
> Ligatures in a TeX-created document are lost, which are regognised by v. 1.4,
> e.g.
> 1.4 1.5
> official ocial
> effort e ort
> fields elds
> first rst
> In addition, German umlauts (ä, ö, ü) are represented as ( a, o, u),
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira