[ https://issues.apache.org/jira/browse/PDFBOX-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
John Hewson updated PDFBOX-2409: -------------------------------- Attachment: TextEdit-Arial.png I'm able to get the same rendering if I use TextEdit on my Mac with Arial as the font, I've attached a screenshot. Looking at the text I think that the diacritics are being placed on the wrong character, one character too late? (i.e. to the left). What do you think? This could well be a PDFBox text encoding issue with combining diacritics. > got the wrong result from Arabic text extraction > ------------------------------------------------ > > Key: PDFBOX-2409 > URL: https://issues.apache.org/jira/browse/PDFBOX-2409 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.8.7, 2.0.0 > Environment: Ubuntu 14.04 64bit > java version "1.8.0_20" > Reporter: EugenePig > Assignee: John Hewson > Attachments: THESSALONIANS.pdf, THESSALONIANS.txt, > THESSALONIANS_win7_firefox.jpg, TextEdit-Arial.png, jahewson.mac.png > > > java -jar pdfbox-app-1.8.7.jar ExtractText -sort -encoding UTF-8 > THESSALONIANS.pdf > java -jar pdfbox-app-2.0.0-SNAPSHOT.jar ExtractText -sort -encoding UTF-8 > THESSALONIANS.pdf > Please compare THESSALONIANS.txt.jpg with THESSALONIANS.pdf. There are a lot > of differences. I just marked a few differences with red circles. -- This message was sent by Atlassian JIRA (v6.3.4#6332)