[jira] Commented: (PDFBOX-377) Incorrect direction of extracted Arabic Text

Jukka Zitting (JIRA) Mon, 17 Nov 2008 19:08:47 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648468#action_12648468
 ]


Jukka Zitting commented on PDFBOX-377:
--------------------------------------

Is there any reasonable way to achieve this without the ICU4J dependency?

There's nothing wrong with ICU4J (it's a solid piece of work with a nice 
license), but it's a pretty large library and having it as a mandatory 
dependency (with this patch the PDFTextStripper class would not even load 
without ICU4J in the classpath) might be troublesome for some users.


> Incorrect direction of extracted Arabic Text
> --------------------------------------------
>
>                 Key: PDFBOX-377
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-377
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Brian Carrier
>         Attachments: hello3.pdf, PDFTextStripper.diff
>
>
> Arabic text (and other right to left languages) is stored in presentation 
> format in PDF files, which is the opposite of the logical order that Arabic 
> text is typically stored. Arabic text is typically stored such that the first 
> byte is for the right-most character, but the output of PDFBox has the first 
> byte always being the left-most character. 
> Further, PDF files typically store the presentation form of Arabic characters 
> instead the more general form. For example, U+FB50 instead of U+0671. The 
> presentation form is not supposed to be stored in the logical form, but 
> PDFBox does not normalize them out. 
> The attached patch solves both of these problems using the ICU4J library 
> (http://www.icu-project.org/).  It identifies the dominant text direction of 
> each page and reverses the order of each line (only if any right to left text 
> exists).  It then normalizes the text to remove the presentation forms. 
> An example file is attached.  Without the patch, the following is 
> (incorrectly) produced:
> Hello ﺪﻤﺤﻣ World. 
> With the patch, the following is (correctly) produced:
> Hello محمد World. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-377) Incorrect direction of extracted Arabic Text

Reply via email to