[ 
https://issues.apache.org/jira/browse/PDFBOX-377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Carrier updated PDFBOX-377:
---------------------------------

    Attachment: reorder-patch.zip

Updated patch (with ICU jar file)

> Incorrect direction of extracted Arabic Text
> --------------------------------------------
>
>                 Key: PDFBOX-377
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-377
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Brian Carrier
>         Attachments: hello3.pdf, PDFTextStripper.diff, reorder-patch.zip
>
>
> Arabic text (and other right to left languages) is stored in presentation 
> format in PDF files, which is the opposite of the logical order that Arabic 
> text is typically stored. Arabic text is typically stored such that the first 
> byte is for the right-most character, but the output of PDFBox has the first 
> byte always being the left-most character. 
> Further, PDF files typically store the presentation form of Arabic characters 
> instead the more general form. For example, U+FB50 instead of U+0671. The 
> presentation form is not supposed to be stored in the logical form, but 
> PDFBox does not normalize them out. 
> The attached patch solves both of these problems using the ICU4J library 
> (http://www.icu-project.org/).  It identifies the dominant text direction of 
> each page and reverses the order of each line (only if any right to left text 
> exists).  It then normalizes the text to remove the presentation forms. 
> An example file is attached.  Without the patch, the following is 
> (incorrectly) produced:
> Hello ﺪﻤﺤﻣ World. 
> With the patch, the following is (correctly) produced:
> Hello محمد World. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to