[ 
https://issues.apache.org/jira/browse/PDFBOX-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Fischer updated PDFBOX-970:
----------------------------------

    Attachment: Test2.pdf
                Test2-1.6.txt
                Test2.1.4.txt

I put a file icu-4.0.1.jar into my classpath and that essentially resolved the 
umlaut issue, they are now represented as combined characters (I'm not quite 
sure what search engines do with those). Nevertheless, pdfbox 1.4 didn't need 
the additional icu, was the need introduced in a recent version change?
Unfortunately there are still some strange problems with the conversion, 
essentially missing characters. I upload a new test file and conversions using 
pdfbox 1.4 and 1.6 respectively; comparison shows the errors (and some 
additional differences).

> TeX-created ligatures and umlauts are not recognised
> ----------------------------------------------------
>
>                 Key: PDFBOX-970
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-970
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox
>    Affects Versions: 1.5.0
>         Environment: Mac OS X 10.6.6, Java(TM) SE Runtime Environment (build 
> 1.6.0_22-b04-307-10M3261)
>            Reporter: Thomas Fischer
>              Labels: textExtraction
>         Attachments: A Python Library for Provenance Recording and 
> Querying.txt, A Python Library for Provenance Recording and Querying.txt, 
> Test.pdf, Test.pdf, Test2-1.6.txt, Test2.1.4.txt, Test2.pdf
>
>
> Ligatures in a TeX-created document are lost, which are regognised by v. 1.4, 
> e.g.
>   1.4          1.5
> official      ocial
> effort        e ort
> fields        elds
> first          rst
> In addition, German umlauts (ä, ö, ü) are represented as ( a,  o,  u), 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to