Arabic compound characters not recognized by pdfbox

Ahmet Aker Thu, 25 Sep 2014 07:28:08 -0700

Hi,

I am using pdfBox (1.8.6) for converting Arabic pdf files (not images of
texts but real texts) to html. PdfBox works really good in most cases
however, it does have problems in recognizing compound characters. I am
attaching you a sample pdf file. In that e.g. I get
&#1575;&#1604;&#1601;&#1594;&#1575;&#1606;&#1610;  but I should be getting
&#1575;&#1604;&#1571;&#1601;&#1594;&#1575;&#1606;&#1610; (الأفغاني). The
pdfBox misses the bit highlighted red.   The same is valid for:


 

&#1575; (pdfBox output) --- &#1575;&#1604;&#1604;&#1607; (الله)

 

Has this maybe to do with the encodings? I hope you can help me on this
matter.

 

Many thanks,

ahmet

Arabic compound characters not recognized by pdfbox

Reply via email to