[jira] [Created] (PDFBOX-2382) Arabic compound words are displayed incorrectly

Ahmet (JIRA) Fri, 26 Sep 2014 00:43:44 -0700

Ahmet created PDFBOX-2382:
-----------------------------

             Summary: Arabic compound words are displayed incorrectly
                 Key: PDFBOX-2382
                 URL: https://issues.apache.org/jira/browse/PDFBOX-2382
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.8.6
         Environment: Windows 7, NetBeans 8.0, Java 8
            Reporter: Ahmet



Hi, I am using pdfBox (1.8.6) for converting Arabic pdf files (not images of 
texts but real texts) to html. PdfBox works really good in most cases however, 
it does have problems in recognizing compound characters. I am attaching you a 
sample pdf file. In that e.g. I get 
&#1575;&#1604;&#1601;&#1594;&#1575;&#1606;&#1610;  but I should be getting  
&#1575;&#1604;&#1571;&#1601;&#1594;&#1575;&#1606;&#1610; (الأفغاني). The pdfBox 
misses the bit highlighted red.   The same is valid for:  &#1575; (pdfBox 
output) --- &#1575;&#1604;&#1604;&#1607; (الله)  Has this maybe to do with the 
encodings? I hope you can help me on this matter.

I know this was somewhat reported and the results said that this issue is due 
to how the pdf file is generated. Is there a way to generate a "correct" pdf 
file so PDFBox does perform correct text extraction? I created the attached 
file using OpenOffice 4.0. The original document is in MS Word format and was 
converted with OpenOffice. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (PDFBOX-2382) Arabic compound words are displayed incorrectly

Reply via email to