[ https://issues.apache.org/jira/browse/PDFBOX-377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Brian Carrier resolved PDFBOX-377. ---------------------------------- Resolution: Fixed Fix Version/s: 0.8.0-incubator Assignee: Brian Carrier Patch checked into trunk revision 734151. > Incorrect direction of extracted Arabic Text > -------------------------------------------- > > Key: PDFBOX-377 > URL: https://issues.apache.org/jira/browse/PDFBOX-377 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 0.8.0-incubator > Reporter: Brian Carrier > Assignee: Brian Carrier > Fix For: 0.8.0-incubator > > Attachments: hello3.pdf, PDFTextStripper.diff, reorder-patch.zip > > > Arabic text (and other right to left languages) is stored in presentation > format in PDF files, which is the opposite of the logical order that Arabic > text is typically stored. Arabic text is typically stored such that the first > byte is for the right-most character, but the output of PDFBox has the first > byte always being the left-most character. > Further, PDF files typically store the presentation form of Arabic characters > instead the more general form. For example, U+FB50 instead of U+0671. The > presentation form is not supposed to be stored in the logical form, but > PDFBox does not normalize them out. > The attached patch solves both of these problems using the ICU4J library > (http://www.icu-project.org/). It identifies the dominant text direction of > each page and reverses the order of each line (only if any right to left text > exists). It then normalizes the text to remove the presentation forms. > An example file is attached. Without the patch, the following is > (incorrectly) produced: > Hello ﺪﻤﺤﻣ World. > With the patch, the following is (correctly) produced: > Hello محمد World. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.