Hi,

I am using pdfBox (1.8.6) for converting Arabic pdf files (not images of
texts but real texts) to html. PdfBox works really good in most cases
however, it does have problems in recognizing compound characters. I am
attaching you a sample pdf file. In that e.g. I get
الفغاني  but I should be getting
الأفغاني (الأفغاني). The
pdfBox misses the bit highlighted red.   The same is valid for:

 

ا (pdfBox output) --- الله (الله)

 

Has this maybe to do with the encodings? I hope you can help me on this
matter.

 

Many thanks,

ahmet

 

 

 

Reply via email to