RE: Arabic compound characters not recognized by pdfbox

Ahmet Aker Fri, 26 Sep 2014 00:33:08 -0700

Hi John,
Thanks for coming back to me. I will attach my pdf document to the JIRA.


Thanks,
ahmet

-----Original Message-----
From: John Hewson [mailto:[email protected]] 
Sent: 25 September 2014 18:57
To: [email protected]
Subject: Re: Arabic compound characters not recognized by pdfbox

Hi Ahmet

We’re currently looking into a similar problem
https://issues.apache.org/jira/browse/PDFBOX-2259

If you think this is the *exact* same problem that you’re seeing, please
attach your PDF file to that JIRA issue, if not then please open a new JIRA
issue and attach your file. (You can attach files in JIRA using More >
Attach Files).

The mailing list does not support file attachments, so we can’t see your
file unless it is on JIRA.

Thanks

-- John

On 25 Sep 2014, at 07:26, Ahmet Aker <[email protected]> wrote:

> Hi,
> I am using pdfBox (1.8.6) for converting Arabic pdf files (not images of
texts but real texts) to html. PdfBox works really good in most cases
however, it does have problems in recognizing compound characters. I am
attaching you a sample pdf file. In that e.g. I get
&#1575;&#1604;&#1601;&#1594;&#1575;&#1606;&#1610;  but I should be getting
&#1575;&#1604;&#1571;&#1601;&#1594;&#1575;&#1606;&#1610; (الأفغاني). The
pdfBox misses the bit highlighted red.   The same is valid for:
>  
> &#1575; (pdfBox output) --- &#1575;&#1604;&#1604;&#1607; (الله)
>  
> Has this maybe to do with the encodings? I hope you can help me on this
matter.
>  
> Many thanks,
> ahmet

RE: Arabic compound characters not recognized by pdfbox

Reply via email to