[jira] [Commented] (PDFBOX-1424) Wrong glyph (Persian) is used in extacted text instead of the original glyph (Persian) in PDF file

Ali Majdzadeh Kohbanani (JIRA) Fri, 02 Nov 2012 15:36:14 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489802#comment-13489802
 ]


Ali Majdzadeh Kohbanani commented on PDFBOX-1424:
-------------------------------------------------

I performed the test. Here are the results:
And yes, the results were the same. Well, there exist errors in both PDFBox 
extracted text and adobe-test extracted text which I don't know whether it is 
possible to resolve them in PDFBox or not, but I will list them here:
1) Only the first line is extracted correctly.
2) The second line has two errors. The word "من" which means "I" in English is 
extracted as "هي" which has no meaning in Persian. Also, the word "هستم" which 
means "am" in English is extracted as "هستن" which means "being" in English.
3) The third line has one error. The word "شما" which means "you" in English is 
extracted as "شوا" which has no meaning in Persian.
4) The fourth line has two errors. First of all, the word "ایران" which means 
"Iran" in English is extracted as "ایرای" which has no meaning in Persian. 
Also, the word "سرزمینی" which means "land" in English" is extracted as "سرزهي 
یٌ" which has no meaning in Persian, actually it is not a Persian word.
5) The last line has multiple errors. The sentence "ما در ایران زندگی می‌کنیم" 
which means "We are living in Iran" in English is extracted as "ها در ايراى ز 
دًگی هی ك يٌن" in which non of the words are valid Persian words, except "در" 
(in). Below, I list the errors.
Extracted                            Original
ما                                       ها
ایران                                    ایرای
زندگی                                    زدًگی
می‌کنیم                                   هی ك يٌن
                
> Wrong glyph (Persian)  is used in extacted text instead of the original glyph 
> (Persian) in PDF file
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1424
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1424
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.7.1
>         Environment: Windows XP, Java 1.6.0
>            Reporter: Ali Majdzadeh Kohbanani
>            Assignee: Andreas Lehmkühler
>             Fix For: 1.8.0
>
>         Attachments: PDFBOX1424-persian_test.html, persian_test.html, 
> persian_test.pdf
>
>
> Hi
> I am very new to PDFBox and I am dealing with Persian PDF files. When I 
> convert Persian PDF files using PDFBox-app, some Persian glyphs like م are 
> displayed wrongly in the extracted text. For example, the word "هستم" in 
> Persian is extracted as "هستن" and "من" in Persian is extracted as "هن". 
> Also, the word "سلام" is extracted as "سالم". By the way, I have tested 
> extracting text from a complete Persian PDF file with multiple pages; the 
> result is disappointing. Actually, it is totally wrong. Please let me know if 
> I should upload an example Persian PDF file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1424) Wrong glyph (Persian) is used in extacted text instead of the original glyph (Persian) in PDF file

Reply via email to