[
https://issues.apache.org/jira/browse/PDFBOX-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477249#comment-13477249
]
Andreas Lehmkühler edited comment on PDFBOX-1424 at 10/17/12 8:13 AM:
----------------------------------------------------------------------
Sorry, no offense, but as an european I have to admit that everything is
looking the same for me ;-)
Can you please help me to understand the problem in detail, so that I might
find out where to look.
For example:
the word "سلام" is extracted as "سالم"
What is wrong here? Is the correct word a ligature? Are the characters simply
the wrong ones? If so, why are they wrong (wrong character, wrong language)?
What about the other issues, are they similar?
was (Author: lehmi):
Sorry, no offense, but as an european I have to admit that everything is
looking the same for me ;-)
Can you please help me to understand the problem in detail, so that I might
find out where to look.
For example:
the word "سلام" is extracted as "سالم"
What is wrong here? Is the correct word a ligature? Re the characters are
simply the wrong ones? If so, why are the wrong (wrong character, wrong
language)? What about the other issues, are they similar?
> Wrong glyph (Persian) is used in extacted text instead of the original glyph
> (Persian) in PDF file
> ---------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-1424
> URL: https://issues.apache.org/jira/browse/PDFBOX-1424
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.7.1
> Environment: Windows XP, Java 1.6.0
> Reporter: Ali Majdzadeh Kohbanani
> Attachments: persian_test.html, persian_test.pdf
>
>
> Hi
> I am very new to PDFBox and I am dealing with Persian PDF files. When I
> convert Persian PDF files using PDFBox-app, some Persian glyphs like م are
> displayed wrongly in the extracted text. For example, the word "هستم" in
> Persian is extracted as "هستن" and "من" in Persian is extracted as "هن".
> Also, the word "سلام" is extracted as "سالم". By the way, I have tested
> extracting text from a complete Persian PDF file with multiple pages; the
> result is disappointing. Actually, it is totally wrong. Please let me know if
> I should upload an example Persian PDF file.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira