[
https://issues.apache.org/jira/browse/PDFBOX-755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
John Hewson resolved PDFBOX-755.
--------------------------------
Resolution: Not a Problem
Update: Passing {{-encoding "UTF-8"}} to ExtractText gets me the combined
characters as expected:
{code}
S. KALABUŠIĆ AND M. R. S. KULENOVIĆ
{code}
Which can be done programatically via:
{code}
new PDFTextStripper("UTF-8")
{code}
> Wrong translation of capital letters with combining diacritics
> --------------------------------------------------------------
>
> Key: PDFBOX-755
> URL: https://issues.apache.org/jira/browse/PDFBOX-755
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.2.0
> Environment: Mac OS X 10.6.4
> Reporter: Thomas Fischer
> Attachments: 139-p.1+3.pdf, 139-p.1+3.txt
>
>
> S. KALABUˇSI ´C ANDM. R. S. KULENOVI ´C
> vs.
> S. KALABUŠIĆ AND M. R. S. KULENOVIĆ
> 1. ´ before vs. ́ behind the letter (\x20 \xB4 vs. \x301)
> 2. ˇ before vs. ̌ behind the letter (\x27C vs. \x30C)
> 3. ANDM. : space missing
> Note:
> S. Kalabušić is translated correctly
--
This message was sent by Atlassian JIRA
(v6.2#6252)