[ 
https://issues.apache.org/jira/browse/PDFBOX-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson updated PDFBOX-2409:
--------------------------------
    Attachment: adobe-utf8.txt

Your golden sample isn't correct, U+0020 is a space character, so you've got 
the diacritics placed over spaces rather than over the character they should 
combine with. I've attached the text of the title which I extracted with Adobe 
Acrobat as UTF-8.

Here's a comparison:

- Acrobat:
h1. الرِّسَالَةُ الأُولَى إِلَى مُؤْمِنِي تَسَالُونِيكِي 

- PDFBox:
h1. الرَِّساَلُة اُلأوَلى ِإَلى ُمْؤِمِني َتَساُلوِنيِكي

It seems that PDFBox has moved many of the diacritics one character to the 
right. The first misplaced diacritic is on the letter س and looking at the 
Unicode we see that PDFBox outputs the diacritic before the letter it applies 
to (wrong). It looks like this happens when the letter preceding it has 
diacritics too, and those diacritics appear to have been switched.

Conclusion: PDFBox is mangling the diacritics somehow.

> got the wrong result from Arabic text extraction
> ------------------------------------------------
>
>                 Key: PDFBOX-2409
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2409
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.7, 2.0.0
>         Environment: Ubuntu 14.04 64bit
> java version "1.8.0_20"
>            Reporter: EugenePig
>            Assignee: John Hewson
>         Attachments: THESSALONIANS.line - golden.txt, THESSALONIANS.pdf, 
> THESSALONIANS.txt, THESSALONIANS_win7_firefox.jpg, TextEdit-Arial.png, 
> adobe-utf8.txt, jahewson.mac.png
>
>
> java -jar pdfbox-app-1.8.7.jar ExtractText -sort -encoding UTF-8 
> THESSALONIANS.pdf
> java -jar pdfbox-app-2.0.0-SNAPSHOT.jar ExtractText -sort -encoding UTF-8 
> THESSALONIANS.pdf
> Please compare THESSALONIANS.txt.jpg with THESSALONIANS.pdf. There are a lot 
> of differences. I just marked a few differences with red circles.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to