[jira] Resolved: (PDFBOX-377) Incorrect direction of extracted Arabic Text

Brian Carrier (JIRA) Tue, 13 Jan 2009 11:20:49 -0800

     [ 
https://issues.apache.org/jira/browse/PDFBOX-377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Brian Carrier resolved PDFBOX-377.
----------------------------------

       Resolution: Fixed
    Fix Version/s: 0.8.0-incubator
         Assignee: Brian Carrier

Patch checked into trunk revision 734151.

> Incorrect direction of extracted Arabic Text
> --------------------------------------------
>
>                 Key: PDFBOX-377
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-377
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Brian Carrier
>            Assignee: Brian Carrier
>             Fix For: 0.8.0-incubator
>
>         Attachments: hello3.pdf, PDFTextStripper.diff, reorder-patch.zip
>
>
> Arabic text (and other right to left languages) is stored in presentation 
> format in PDF files, which is the opposite of the logical order that Arabic 
> text is typically stored. Arabic text is typically stored such that the first 
> byte is for the right-most character, but the output of PDFBox has the first 
> byte always being the left-most character. 
> Further, PDF files typically store the presentation form of Arabic characters 
> instead the more general form. For example, U+FB50 instead of U+0671. The 
> presentation form is not supposed to be stored in the logical form, but 
> PDFBox does not normalize them out. 
> The attached patch solves both of these problems using the ICU4J library 
> (http://www.icu-project.org/).  It identifies the dominant text direction of 
> each page and reverses the order of each line (only if any right to left text 
> exists).  It then normalizes the text to remove the presentation forms. 
> An example file is attached.  Without the patch, the following is 
> (incorrectly) produced:
> Hello ﺪﻤﺤﻣ World. 
> With the patch, the following is (correctly) produced:
> Hello محمد World. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PDFBOX-377) Incorrect direction of extracted Arabic Text

Reply via email to