[ 
https://issues.apache.org/jira/browse/PDFBOX-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489295#comment-13489295
 ] 

Ali Majdzadeh Kohbanani commented on PDFBOX-1424:
-------------------------------------------------

Andreas, thanks a lot for your consideration. I checked the file, the issue 
related to the ligature is fixed, but there exist misused glyphs in the output 
file. I have searched a lot about this issue and found that PDF files created 
with dompdf library (code.google.com/p/dompdf/) are very well extracted by 
PDFBox. These files even do not face the ligature problem and also the misused 
glyphs. Please let me know if I should provide more detail information on this.
                
> Wrong glyph (Persian)  is used in extacted text instead of the original glyph 
> (Persian) in PDF file
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1424
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1424
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.7.1
>         Environment: Windows XP, Java 1.6.0
>            Reporter: Ali Majdzadeh Kohbanani
>            Assignee: Andreas Lehmkühler
>             Fix For: 1.8.0
>
>         Attachments: PDFBOX1424-persian_test.html, persian_test.html, 
> persian_test.pdf
>
>
> Hi
> I am very new to PDFBox and I am dealing with Persian PDF files. When I 
> convert Persian PDF files using PDFBox-app, some Persian glyphs like م are 
> displayed wrongly in the extracted text. For example, the word "هستم" in 
> Persian is extracted as "هستن" and "من" in Persian is extracted as "هن". 
> Also, the word "سلام" is extracted as "سالم". By the way, I have tested 
> extracting text from a complete Persian PDF file with multiple pages; the 
> result is disappointing. Actually, it is totally wrong. Please let me know if 
> I should upload an example Persian PDF file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to