[ 
https://issues.apache.org/jira/browse/PDFBOX-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034375#comment-14034375
 ] 

John Hewson edited comment on PDFBOX-970 at 6/17/14 9:18 PM:
-------------------------------------------------------------

-I'm not getting combined characters for the umlaut with 2.0 trunk-. 
Interestingly enough, Adobe Acrobat strips the umlaut and OSX Preview extracts 
it as "fu ̈r", so it's not clear that we really need to be trying to combine it.

Update: Passing {{-encoding "UTF-8"}} to ExtractText gets me the combined 
characters as expected.


was (Author: jahewson):
I'm not getting combined characters for the umlaut with 2.0 trunk. 
Interestingly enough, Adobe Acrobat strips the umlaut and OSX Preview extracts 
it as "fu ̈r", so it's not clear that we really need to be trying to combine it.

> TeX-created ligatures and umlauts are not recognised
> ----------------------------------------------------
>
>                 Key: PDFBOX-970
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-970
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.5.0
>         Environment: Mac OS X 10.6.6, Java(TM) SE Runtime Environment (build 
> 1.6.0_22-b04-307-10M3261)
>            Reporter: Thomas Fischer
>              Labels: textExtraction
>         Attachments: A Python Library for Provenance Recording and 
> Querying.txt, A Python Library for Provenance Recording and Querying.txt, 
> Test.pdf, Test.pdf, Test2-1.6.txt, Test2.1.4.txt, Test2.pdf
>
>
> Ligatures in a TeX-created document are lost, which are regognised by v. 1.4, 
> e.g.
>   1.4          1.5
> official      ocial
> effort        e ort
> fields        elds
> first          rst
> In addition, German umlauts (ä, ö, ü) are represented as ( a,  o,  u), 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to