[ 
https://issues.apache.org/jira/browse/PDFBOX-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13490003#comment-13490003
 ] 

Andreas Lehmkühler commented on PDFBOX-1424:
--------------------------------------------

Thanks for the test and the in-depth analysis. I double checked the unicode 
mapping provided in the pdf and the result and came to the conclusion that from 
the pdf point of view everything works fine. I can't see any issue with pdfbox 
and the fact that adobe provides the same result confirms my assumption. The 
unicode mapping within the pdf is simply wrong. I'm afraid you have to blame 
the tool which you were using to create the pdf or probably the creator of the 
font.
                
> Wrong glyph (Persian)  is used in extacted text instead of the original glyph 
> (Persian) in PDF file
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1424
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1424
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.7.1
>         Environment: Windows XP, Java 1.6.0
>            Reporter: Ali Majdzadeh Kohbanani
>            Assignee: Andreas Lehmkühler
>             Fix For: 1.8.0
>
>         Attachments: PDFBOX1424-persian_test.html, persian_test.html, 
> persian_test.pdf
>
>
> Hi
> I am very new to PDFBox and I am dealing with Persian PDF files. When I 
> convert Persian PDF files using PDFBox-app, some Persian glyphs like م are 
> displayed wrongly in the extracted text. For example, the word "هستم" in 
> Persian is extracted as "هستن" and "من" in Persian is extracted as "هن". 
> Also, the word "سلام" is extracted as "سالم". By the way, I have tested 
> extracting text from a complete Persian PDF file with multiple pages; the 
> result is disappointing. Actually, it is totally wrong. Please let me know if 
> I should upload an example Persian PDF file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to