[
https://issues.apache.org/jira/browse/PDFBOX-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ahmet updated PDFBOX-2382:
--------------------------
Description:
Hi, I am using pdfBox (1.8.6) for converting Arabic pdf files (not images of
texts but real texts) to html. PdfBox works really good in most cases however,
it does have problems in recognizing compound characters. I am attaching you a
sample pdf file. In that e.g. I get
الفغاني but I should be getting
الأفغاني (الأفغاني). The pdfBox
misses the bit highlighted red. The same is valid for: ا (pdfBox
output) --- الله (الله) Has this maybe to do with the
encodings? I hope you can help me on this matter.
I know this was somewhat reported and the results said that this issue is due
to how the pdf file is generated. Is there a way to generate a "correct" pdf
file so PDFBox does perform correct text extraction? I created the attached
file using OpenOffice 4.0. The original document is in MS Word format and was
converted with OpenOffice into pdf.
was:
Hi, I am using pdfBox (1.8.6) for converting Arabic pdf files (not images of
texts but real texts) to html. PdfBox works really good in most cases however,
it does have problems in recognizing compound characters. I am attaching you a
sample pdf file. In that e.g. I get
الفغاني but I should be getting
الأفغاني (الأفغاني). The pdfBox
misses the bit highlighted red. The same is valid for: ا (pdfBox
output) --- الله (الله) Has this maybe to do with the
encodings? I hope you can help me on this matter.
I know this was somewhat reported and the results said that this issue is due
to how the pdf file is generated. Is there a way to generate a "correct" pdf
file so PDFBox does perform correct text extraction? I created the attached
file using OpenOffice 4.0. The original document is in MS Word format and was
converted with OpenOffice.
> Arabic compound words are displayed incorrectly
> -----------------------------------------------
>
> Key: PDFBOX-2382
> URL: https://issues.apache.org/jira/browse/PDFBOX-2382
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.6
> Environment: Windows 7, NetBeans 8.0, Java 8
> Reporter: Ahmet
> Attachments: arabicDoc2.doc, arabicDoc2.pdf
>
>
> Hi, I am using pdfBox (1.8.6) for converting Arabic pdf files (not images of
> texts but real texts) to html. PdfBox works really good in most cases
> however, it does have problems in recognizing compound characters. I am
> attaching you a sample pdf file. In that e.g. I get
> الفغاني but I should be getting
> الأفغاني (الأفغاني). The
> pdfBox misses the bit highlighted red. The same is valid for: ا
> (pdfBox output) --- الله (الله) Has this maybe to do
> with the encodings? I hope you can help me on this matter.
> I know this was somewhat reported and the results said that this issue is due
> to how the pdf file is generated. Is there a way to generate a "correct" pdf
> file so PDFBox does perform correct text extraction? I created the attached
> file using OpenOffice 4.0. The original document is in MS Word format and was
> converted with OpenOffice into pdf.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)