[ 
https://issues.apache.org/jira/browse/PDFBOX-377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419856#comment-13419856
 ] 

amin bouja commented on PDFBOX-377:
-----------------------------------

hey people,
I am having problems in applying this patch to my extracting arabic content 
from PDF files project.
PS: I am using Eclipse, I tried: right click on the project --> team --> apply 
patch ---> I chose PDFTextStripper.diff BUT it dosn't work
help me please
                
> Incorrect direction of extracted Arabic Text
> --------------------------------------------
>
>                 Key: PDFBOX-377
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-377
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Brian Carrier
>            Assignee: Brian Carrier
>             Fix For: 0.8.0-incubator
>
>         Attachments: PDFTextStripper.diff, hello3.pdf, reorder-patch.zip
>
>
> Arabic text (and other right to left languages) is stored in presentation 
> format in PDF files, which is the opposite of the logical order that Arabic 
> text is typically stored. Arabic text is typically stored such that the first 
> byte is for the right-most character, but the output of PDFBox has the first 
> byte always being the left-most character. 
> Further, PDF files typically store the presentation form of Arabic characters 
> instead the more general form. For example, U+FB50 instead of U+0671. The 
> presentation form is not supposed to be stored in the logical form, but 
> PDFBox does not normalize them out. 
> The attached patch solves both of these problems using the ICU4J library 
> (http://www.icu-project.org/).  It identifies the dominant text direction of 
> each page and reverses the order of each line (only if any right to left text 
> exists).  It then normalizes the text to remove the presentation forms. 
> An example file is attached.  Without the patch, the following is 
> (incorrectly) produced:
> Hello ﺪﻤﺤﻣ World. 
> With the patch, the following is (correctly) produced:
> Hello محمد World. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to