Ahmet created PDFBOX-2382:
-----------------------------
Summary: Arabic compound words are displayed incorrectly
Key: PDFBOX-2382
URL: https://issues.apache.org/jira/browse/PDFBOX-2382
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.8.6
Environment: Windows 7, NetBeans 8.0, Java 8
Reporter: Ahmet
Hi, I am using pdfBox (1.8.6) for converting Arabic pdf files (not images of
texts but real texts) to html. PdfBox works really good in most cases however,
it does have problems in recognizing compound characters. I am attaching you a
sample pdf file. In that e.g. I get
الفغاني but I should be getting
الأفغاني (الأفغاني). The pdfBox
misses the bit highlighted red. The same is valid for: ا (pdfBox
output) --- الله (الله) Has this maybe to do with the
encodings? I hope you can help me on this matter.
I know this was somewhat reported and the results said that this issue is due
to how the pdf file is generated. Is there a way to generate a "correct" pdf
file so PDFBox does perform correct text extraction? I created the attached
file using OpenOffice 4.0. The original document is in MS Word format and was
converted with OpenOffice.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)