[ https://issues.apache.org/jira/browse/PDFBOX-377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653529#action_12653529 ]
Brian Carrier commented on PDFBOX-377: -------------------------------------- I have redone this patch to make it optional at runtime and so that it takes into account other code changes that have been made since the original patch. ICU4J is needed to build PDFBOX, but it tests for the relevant classes at runtime to determine if the ICU4J components should be used. Updated ant and maven files are included in the diff. While working on this patch, I realized that PDFTextStripper (and its subclasses) were not consistent with how they wrote to the output (some called output directly and others used wrapper functions). I made it more consistent and added some more wrappers (such as writeString). I then realized that some of the functions were not consistently named. For example processLineSeparator() was not really processing a lineseparator in the PDF file (like processPage() does). It prints a line separator, so I renamed it to writeLineSeparator() and deprecated the original. There were a few other functions that I found to be inconsistently named and so I made them more consistent (and made deprecated wrappers for backwards compatibility). For example: - PDFStreamEngine.showCharacter() was renamed to processTextPosition() because it doesn't always show a character and it is in the hierarchy of processXXX() functions that include processPages(), processPage(), etc. - Similarly, PDFStreamEngine.showString() was renamed to processEncodedText() because a) it doesn't display anything and b) it takes encoded data as input (not a normal string). - PDFTextStripper.flushText() was renamed to writePage() because it is the writing counterpart to processPage() and it operates at the page scale, versus document scale. I migrated these renames to the classes that use them to remove the deprecated warnings. There are three failures on the regression tests. They are improvements. - The 10101-AR.pdf now has more correct arabic text in it. It is better if sorting is enabled, but the tests do not use sorting. - The cweb.pdf and Garcia2004_thesis.pdf failures are now both better because the 'ff' ligature has been removed and replaced with "f" and "f". > Incorrect direction of extracted Arabic Text > -------------------------------------------- > > Key: PDFBOX-377 > URL: https://issues.apache.org/jira/browse/PDFBOX-377 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 0.8.0-incubator > Reporter: Brian Carrier > Attachments: hello3.pdf, PDFTextStripper.diff > > > Arabic text (and other right to left languages) is stored in presentation > format in PDF files, which is the opposite of the logical order that Arabic > text is typically stored. Arabic text is typically stored such that the first > byte is for the right-most character, but the output of PDFBox has the first > byte always being the left-most character. > Further, PDF files typically store the presentation form of Arabic characters > instead the more general form. For example, U+FB50 instead of U+0671. The > presentation form is not supposed to be stored in the logical form, but > PDFBox does not normalize them out. > The attached patch solves both of these problems using the ICU4J library > (http://www.icu-project.org/). It identifies the dominant text direction of > each page and reverses the order of each line (only if any right to left text > exists). It then normalizes the text to remove the presentation forms. > An example file is attached. Without the patch, the following is > (incorrectly) produced: > Hello ﺪﻤﺤﻣ World. > With the patch, the following is (correctly) produced: > Hello محمد World. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.