[ 
https://issues.apache.org/jira/browse/PDFBOX-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477307#comment-13477307
 ] 

Ali Majdzadeh Kohbanani commented on PDFBOX-1424:
-------------------------------------------------

Andreas,
Thanks for your comment. You are right, these problems are somehow difficult to 
explain since we don't speak the same language ;)
I will try to explain the issues with more details. The word "هستم" in Persian 
means "I am". For example, "I am Ali" in Persian is "من علی هستم", while the 
extracted text by PDFBox for this word is "هستن" which in everyday Persian is 
not used, but actually it means "being". The issue is that the character "م" is 
extracted as the character "ن". Concerning the word "سلام" the problem is the 
ligature "لا" which consists of characters "ل" and "ا". This ligature is 
extracted as "ال" so that the word "سلام" which in Persian means "Hello" is 
extracted as "سالم" which in Persian means "Healthy"!
Andreas, I have also noticed that these problems do not occur in certain PDF 
files. For example, the file Complex.pdf uploaded as an example in 
https://issues.apache.org/jira/browse/TIKA-713 has non of these problems. Maybe 
these issues are because of the way the PDF is generated. Are there exist 
special requirements to meet in order to create PDFBox-compatible PDF files? 
Should I use specific tools in order to create PDF files?
As another example, when I use PDFCreator in order to create PDF files and try 
to extract text from them using PDFBox, the result is some junk characters! 
However, when I use the "Save as PDF" plugin for Microsoft Word provided by 
Microsoft, the extracted text is valid but contains the errors I described 
above. I don't know how Complex.pdf is created, but it contains minimum errors, 
at least, it doesn't contain the errors I described. Concerning Complex.pdf, I 
just noticed some words being displaced in extracted text, but the extracted 
text didn't contain these errors.
I searched a lot and tested various tools in order to create PDF files so that 
PDFBox will be able to extract text from them, but I had no success. Sorry, but 
another question that sounds to me related to this issue is the subject of 
fonts. Does the font used in order to type the document and is embedded in the 
PDF file created from the document affect the quality of text extraction 
performed by PDFBox?
I think this comment became too long, sorry for that and lots of thanks for 
your consideration and attention.
                
> Wrong glyph (Persian)  is used in extacted text instead of the original glyph 
> (Persian) in PDF file
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1424
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1424
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.7.1
>         Environment: Windows XP, Java 1.6.0
>            Reporter: Ali Majdzadeh Kohbanani
>         Attachments: persian_test.html, persian_test.pdf
>
>
> Hi
> I am very new to PDFBox and I am dealing with Persian PDF files. When I 
> convert Persian PDF files using PDFBox-app, some Persian glyphs like م are 
> displayed wrongly in the extracted text. For example, the word "هستم" in 
> Persian is extracted as "هستن" and "من" in Persian is extracted as "هن". 
> Also, the word "سلام" is extracted as "سالم". By the way, I have tested 
> extracting text from a complete Persian PDF file with multiple pages; the 
> result is disappointing. Actually, it is totally wrong. Please let me know if 
> I should upload an example Persian PDF file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to