Hi John, Thanks for coming back to me. I will attach my pdf document to the JIRA.
Thanks, ahmet -----Original Message----- From: John Hewson [mailto:[email protected]] Sent: 25 September 2014 18:57 To: [email protected] Subject: Re: Arabic compound characters not recognized by pdfbox Hi Ahmet We’re currently looking into a similar problem https://issues.apache.org/jira/browse/PDFBOX-2259 If you think this is the *exact* same problem that you’re seeing, please attach your PDF file to that JIRA issue, if not then please open a new JIRA issue and attach your file. (You can attach files in JIRA using More > Attach Files). The mailing list does not support file attachments, so we can’t see your file unless it is on JIRA. Thanks -- John On 25 Sep 2014, at 07:26, Ahmet Aker <[email protected]> wrote: > Hi, > I am using pdfBox (1.8.6) for converting Arabic pdf files (not images of texts but real texts) to html. PdfBox works really good in most cases however, it does have problems in recognizing compound characters. I am attaching you a sample pdf file. In that e.g. I get الفغاني but I should be getting الأفغاني (الأفغاني). The pdfBox misses the bit highlighted red. The same is valid for: > > ا (pdfBox output) --- الله (الله) > > Has this maybe to do with the encodings? I hope you can help me on this matter. > > Many thanks, > ahmet
