[ https://issues.apache.org/jira/browse/PDFBOX-377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648468#action_12648468 ]
Jukka Zitting commented on PDFBOX-377: -------------------------------------- Is there any reasonable way to achieve this without the ICU4J dependency? There's nothing wrong with ICU4J (it's a solid piece of work with a nice license), but it's a pretty large library and having it as a mandatory dependency (with this patch the PDFTextStripper class would not even load without ICU4J in the classpath) might be troublesome for some users. > Incorrect direction of extracted Arabic Text > -------------------------------------------- > > Key: PDFBOX-377 > URL: https://issues.apache.org/jira/browse/PDFBOX-377 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 0.8.0-incubator > Reporter: Brian Carrier > Attachments: hello3.pdf, PDFTextStripper.diff > > > Arabic text (and other right to left languages) is stored in presentation > format in PDF files, which is the opposite of the logical order that Arabic > text is typically stored. Arabic text is typically stored such that the first > byte is for the right-most character, but the output of PDFBox has the first > byte always being the left-most character. > Further, PDF files typically store the presentation form of Arabic characters > instead the more general form. For example, U+FB50 instead of U+0671. The > presentation form is not supposed to be stored in the logical form, but > PDFBox does not normalize them out. > The attached patch solves both of these problems using the ICU4J library > (http://www.icu-project.org/). It identifies the dominant text direction of > each page and reverses the order of each line (only if any right to left text > exists). It then normalizes the text to remove the presentation forms. > An example file is attached. Without the patch, the following is > (incorrectly) produced: > Hello ﺪﻤﺤﻣ World. > With the patch, the following is (correctly) produced: > Hello محمد World. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.