Ahmet created PDFBOX-2382:
-----------------------------

             Summary: Arabic compound words are displayed incorrectly
                 Key: PDFBOX-2382
                 URL: https://issues.apache.org/jira/browse/PDFBOX-2382
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.8.6
         Environment: Windows 7, NetBeans 8.0, Java 8
            Reporter: Ahmet


Hi, I am using pdfBox (1.8.6) for converting Arabic pdf files (not images of 
texts but real texts) to html. PdfBox works really good in most cases however, 
it does have problems in recognizing compound characters. I am attaching you a 
sample pdf file. In that e.g. I get 
الفغاني  but I should be getting  
الأفغاني (الأفغاني). The pdfBox 
misses the bit highlighted red.   The same is valid for:  ا (pdfBox 
output) --- الله (الله)  Has this maybe to do with the 
encodings? I hope you can help me on this matter.

I know this was somewhat reported and the results said that this issue is due 
to how the pdf file is generated. Is there a way to generate a "correct" pdf 
file so PDFBox does perform correct text extraction? I created the attached 
file using OpenOffice 4.0. The original document is in MS Word format and was 
converted with OpenOffice. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to