[ 
https://issues.apache.org/jira/browse/PDFBOX-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189158#comment-14189158
 ] 

John Hewson commented on PDFBOX-2409:
-------------------------------------

{quote}
In our sample, U+FC62 and U+FEAE have the same position. They are overlapped. 
Therefore I know U+FC62 is a case of spacing diacritic marks.
{quote}

But it's not a spacing diacritic mark, because the diacritic is not being 
applied to a space. A spacing diacritic mark occurs, as the text you posed 
says, "when a combining diacritic mark is applied to a space character". What 
you actually have is U+FC62 which is the "isolated form" of "SHADDA WITH 
KASRA", which is a non-joining character. This particular Unicode character is 
not a combining diacritic.

However, I do see the problem with those characters in the title and I 
understand that you expected to get U+FEAE and some combining diacritics 
instead of U+FC62, unfortunately that's not what's in the PDF file. It's very 
common to encounter files where the embedded Unicode text doesn't quite match 
what you see on the screen, and this is due to the software which generated the 
PDF not doing it correctly. This is where comparing the output of PDFBox with 
Acrobat can be helpful, to determine if the PDF file itself is the problem. In 
this case it is, PDFBox outputs the same text for the title as Acrobat does. 
Sadly, that means that I can't fix the problem.

> got the wrong result from Arabic text extraction
> ------------------------------------------------
>
>                 Key: PDFBOX-2409
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2409
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.7, 2.0.0
>         Environment: Ubuntu 14.04 64bit
> java version "1.8.0_20"
>            Reporter: EugenePig
>             Fix For: 2.0.0
>
>         Attachments: THESSALONIANS-Commit-1634256.jpg, 
> THESSALONIANS-Commit-1634256.txt, THESSALONIANS-UTF16-Commit-1634256.txt, 
> THESSALONIANS.Sample.jpg, THESSALONIANS.Sample.txt, THESSALONIANS.line - 
> golden.txt, THESSALONIANS.pdf, THESSALONIANS.txt, THESSALONIANS.xml, 
> THESSALONIANS_win7_firefox.jpg, TextEdit-Arial.png, adobe-utf8.txt, 
> jahewson.mac.png
>
>
> java -jar pdfbox-app-1.8.7.jar ExtractText -sort -encoding UTF-8 
> THESSALONIANS.pdf
> java -jar pdfbox-app-2.0.0-SNAPSHOT.jar ExtractText -sort -encoding UTF-8 
> THESSALONIANS.pdf
> Please compare THESSALONIANS.txt.jpg with THESSALONIANS.pdf. There are a lot 
> of differences. I just marked a few differences with red circles.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to