[ https://issues.apache.org/jira/browse/PDFBOX-377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419856#comment-13419856 ]
amin bouja commented on PDFBOX-377: ----------------------------------- hey people, I am having problems in applying this patch to my extracting arabic content from PDF files project. PS: I am using Eclipse, I tried: right click on the project --> team --> apply patch ---> I chose PDFTextStripper.diff BUT it dosn't work help me please > Incorrect direction of extracted Arabic Text > -------------------------------------------- > > Key: PDFBOX-377 > URL: https://issues.apache.org/jira/browse/PDFBOX-377 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 0.8.0-incubator > Reporter: Brian Carrier > Assignee: Brian Carrier > Fix For: 0.8.0-incubator > > Attachments: PDFTextStripper.diff, hello3.pdf, reorder-patch.zip > > > Arabic text (and other right to left languages) is stored in presentation > format in PDF files, which is the opposite of the logical order that Arabic > text is typically stored. Arabic text is typically stored such that the first > byte is for the right-most character, but the output of PDFBox has the first > byte always being the left-most character. > Further, PDF files typically store the presentation form of Arabic characters > instead the more general form. For example, U+FB50 instead of U+0671. The > presentation form is not supposed to be stored in the logical form, but > PDFBox does not normalize them out. > The attached patch solves both of these problems using the ICU4J library > (http://www.icu-project.org/). It identifies the dominant text direction of > each page and reverses the order of each line (only if any right to left text > exists). It then normalizes the text to remove the presentation forms. > An example file is attached. Without the patch, the following is > (incorrectly) produced: > Hello ﺪﻤﺤﻣ World. > With the patch, the following is (correctly) produced: > Hello محمد World. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira