Re: Arabic compound characters not recognized by pdfbox

John Hewson Thu, 25 Sep 2014 10:58:01 -0700

Hi Ahmet

We’re currently looking into a similar problem 
https://issues.apache.org/jira/browse/PDFBOX-2259

If you think this is the *exact* same problem that you’re seeing, please attach 
your PDF file to that JIRA issue, if not then please open a new JIRA issue and 
attach your file. (You can attach files in JIRA using More > Attach Files).

The mailing list does not support file attachments, so we can’t see your file 
unless it is on JIRA.

Thanks

-- John

On 25 Sep 2014, at 07:26, Ahmet Aker <[email protected]> wrote:

> Hi,
> I am using pdfBox (1.8.6) for converting Arabic pdf files (not images of 
> texts but real texts) to html. PdfBox works really good in most cases 
> however, it does have problems in recognizing compound characters. I am 
> attaching you a sample pdf file. In that e.g. I get 
> &#1575;&#1604;&#1601;&#1594;&#1575;&#1606;&#1610;  but I should be getting  
> &#1575;&#1604;&#1571;&#1601;&#1594;&#1575;&#1606;&#1610; (الأفغاني). The 
> pdfBox misses the bit highlighted red.   The same is valid for:
>  
> &#1575; (pdfBox output) --- &#1575;&#1604;&#1604;&#1607; (الله)
>  
> Has this maybe to do with the encodings? I hope you can help me on this 
> matter.
>  
> Many thanks,
> ahmet

Re: Arabic compound characters not recognized by pdfbox

Reply via email to