[ 
https://issues.apache.org/jira/browse/PDFBOX-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Majdzadeh Kohbanani updated PDFBOX-1424:
--------------------------------------------

    Attachment: persian_test.html
                persian_test.pdf

Hi Andreas,
Thanks for your attention. I have uploaded a sample PDF file and the HTML 
output generated by pdfbox-app.
                
> Wrong glyph (Persian)  is used in extacted text instead of the original glyph 
> (Persian) in PDF file
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1424
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1424
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.7.1
>         Environment: Windows XP, Java 1.6.0
>            Reporter: Ali Majdzadeh Kohbanani
>         Attachments: persian_test.html, persian_test.pdf
>
>
> Hi
> I am very new to PDFBox and I am dealing with Persian PDF files. When I 
> convert Persian PDF files using PDFBox-app, some Persian glyphs like م are 
> displayed wrongly in the extracted text. For example, the word "هستم" in 
> Persian is extracted as "هستن" and "من" in Persian is extracted as "هن". 
> Also, the word "سلام" is extracted as "سالم". By the way, I have tested 
> extracting text from a complete Persian PDF file with multiple pages; the 
> result is disappointing. Actually, it is totally wrong. Please let me know if 
> I should upload an example Persian PDF file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to