[jira] [Comment Edited] (PDFBOX-1424) Wrong glyph (Persian) is used in extacted text instead of the original glyph (Persian) in PDF file

JIRA Wed, 17 Oct 2012 01:14:13 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477249#comment-13477249
 ]


Andreas Lehmkühler edited comment on PDFBOX-1424 at 10/17/12 8:13 AM:
----------------------------------------------------------------------

Sorry, no offense, but as an european I have to admit that everything is 
looking the same for me ;-) 

Can you please help me to understand the problem in detail, so that I might 
find out where to look.

For example:

the word "سلام" is extracted as "سالم" 

What is wrong here? Is the correct word a ligature? Are the characters simply 
the wrong ones? If so, why are they wrong (wrong character, wrong language)? 
What about the other issues, are they similar?
                
      was (Author: lehmi):
    Sorry, no offense, but as an european I have to admit that everything is 
looking the same for me ;-) 

Can you please help me to understand the problem in detail, so that I might 
find out where to look.

For example:

the word "سلام" is extracted as "سالم" 

What is wrong here? Is the correct word a ligature? Re the characters are 
simply the wrong ones? If so, why are the wrong (wrong character, wrong 
language)? What about the other issues, are they similar?
                  
> Wrong glyph (Persian)  is used in extacted text instead of the original glyph 
> (Persian) in PDF file
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1424
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1424
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.7.1
>         Environment: Windows XP, Java 1.6.0
>            Reporter: Ali Majdzadeh Kohbanani
>         Attachments: persian_test.html, persian_test.pdf
>
>
> Hi
> I am very new to PDFBox and I am dealing with Persian PDF files. When I 
> convert Persian PDF files using PDFBox-app, some Persian glyphs like م are 
> displayed wrongly in the extracted text. For example, the word "هستم" in 
> Persian is extracted as "هستن" and "من" in Persian is extracted as "هن". 
> Also, the word "سلام" is extracted as "سالم". By the way, I have tested 
> extracting text from a complete Persian PDF file with multiple pages; the 
> result is disappointing. Actually, it is totally wrong. Please let me know if 
> I should upload an example Persian PDF file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (PDFBOX-1424) Wrong glyph (Persian) is used in extacted text instead of the original glyph (Persian) in PDF file

Reply via email to