[jira] [Updated] (PDFBOX-2382) Arabic compound words are displayed incorrectly

Ahmet (JIRA) Fri, 26 Sep 2014 00:48:07 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ahmet updated PDFBOX-2382:
--------------------------
    Description: 
Hi, I am using pdfBox (1.8.6) for converting Arabic pdf files (not images of 
texts but real texts) to html. PdfBox works really good in most cases however, 
it does have problems in recognizing compound characters. I am attaching you a 
sample pdf file. In that e.g. I get 
&#1575;&#1604;&#1601;&#1594;&#1575;&#1606;&#1610;  but I should be getting  
&#1575;&#1604;&#1571;&#1601;&#1594;&#1575;&#1606;&#1610; (الأفغاني). The pdfBox 
misses the bit highlighted red.   The same is valid for:  &#1575; (pdfBox 
output) --- &#1575;&#1604;&#1604;&#1607; (الله)  Has this maybe to do with the 
encodings? I hope you can help me on this matter.

I know this was somewhat reported and the results said that this issue is due 
to how the pdf file is generated. Is there a way to generate a "correct" pdf 
file so PDFBox does perform correct text extraction? I created the attached 
file using OpenOffice 4.0. The original document is in MS Word format and was 
converted with OpenOffice into pdf. 

  was:
Hi, I am using pdfBox (1.8.6) for converting Arabic pdf files (not images of 
texts but real texts) to html. PdfBox works really good in most cases however, 
it does have problems in recognizing compound characters. I am attaching you a 
sample pdf file. In that e.g. I get 
&#1575;&#1604;&#1601;&#1594;&#1575;&#1606;&#1610;  but I should be getting  
&#1575;&#1604;&#1571;&#1601;&#1594;&#1575;&#1606;&#1610; (الأفغاني). The pdfBox 
misses the bit highlighted red.   The same is valid for:  &#1575; (pdfBox 
output) --- &#1575;&#1604;&#1604;&#1607; (الله)  Has this maybe to do with the 
encodings? I hope you can help me on this matter.

I know this was somewhat reported and the results said that this issue is due 
to how the pdf file is generated. Is there a way to generate a "correct" pdf 
file so PDFBox does perform correct text extraction? I created the attached 
file using OpenOffice 4.0. The original document is in MS Word format and was 
converted with OpenOffice. 


> Arabic compound words are displayed incorrectly
> -----------------------------------------------
>
>                 Key: PDFBOX-2382
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2382
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.6
>         Environment: Windows 7, NetBeans 8.0, Java 8
>            Reporter: Ahmet
>         Attachments: arabicDoc2.doc, arabicDoc2.pdf
>
>
> Hi, I am using pdfBox (1.8.6) for converting Arabic pdf files (not images of 
> texts but real texts) to html. PdfBox works really good in most cases 
> however, it does have problems in recognizing compound characters. I am 
> attaching you a sample pdf file. In that e.g. I get 
> &#1575;&#1604;&#1601;&#1594;&#1575;&#1606;&#1610;  but I should be getting  
> &#1575;&#1604;&#1571;&#1601;&#1594;&#1575;&#1606;&#1610; (الأفغاني). The 
> pdfBox misses the bit highlighted red.   The same is valid for:  &#1575; 
> (pdfBox output) --- &#1575;&#1604;&#1604;&#1607; (الله)  Has this maybe to do 
> with the encodings? I hope you can help me on this matter.
> I know this was somewhat reported and the results said that this issue is due 
> to how the pdf file is generated. Is there a way to generate a "correct" pdf 
> file so PDFBox does perform correct text extraction? I created the attached 
> file using OpenOffice 4.0. The original document is in MS Word format and was 
> converted with OpenOffice into pdf. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PDFBOX-2382) Arabic compound words are displayed incorrectly

Reply via email to