[jira] Commented: (PDFBOX-377) Incorrect direction of extracted Arabic Text

Brian Carrier (JIRA) Thu, 04 Dec 2008 14:47:45 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653529#action_12653529
 ]


Brian Carrier commented on PDFBOX-377:
--------------------------------------

I have redone this patch to make it optional at runtime and so that it takes 
into account other code changes that have been made since the original patch. 
ICU4J is needed to build PDFBOX, but it tests for the relevant classes at 
runtime to determine if the ICU4J components should be used.  Updated ant and 
maven files are included in the diff.

While working on this patch, I realized that PDFTextStripper (and its 
subclasses) were not consistent with how they wrote to the output (some called 
output directly and others used wrapper functions).  I made it more consistent 
and added some more wrappers (such as writeString). I then realized that some 
of the functions were not consistently named.  For example 
processLineSeparator() was not really processing a lineseparator in the PDF 
file (like processPage() does). It prints a line separator, so I renamed it to 
writeLineSeparator() and deprecated the original. There were a few other 
functions that I found to be inconsistently named and so I made them more 
consistent (and made deprecated wrappers for backwards compatibility). 

For example:
- PDFStreamEngine.showCharacter() was renamed to processTextPosition() because 
it doesn't always show a character and it is in the hierarchy of processXXX() 
functions that include processPages(), processPage(), etc. 
- Similarly, PDFStreamEngine.showString() was renamed to processEncodedText() 
because a) it doesn't display anything and b) it takes encoded data as input 
(not a normal string). 
- PDFTextStripper.flushText() was renamed to writePage() because it is the 
writing counterpart to processPage() and it operates at the page scale, versus 
document scale. 

I migrated these renames to the classes that use them to remove the deprecated 
warnings. 

There are three failures on the regression tests. They are improvements. 
- The 10101-AR.pdf now has more correct arabic text in it.  It is better if 
sorting is enabled, but the tests do not use sorting.
- The cweb.pdf and Garcia2004_thesis.pdf failures are now both better because 
the 'ff' ligature has been removed and replaced with "f" and "f".


> Incorrect direction of extracted Arabic Text
> --------------------------------------------
>
>                 Key: PDFBOX-377
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-377
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Brian Carrier
>         Attachments: hello3.pdf, PDFTextStripper.diff
>
>
> Arabic text (and other right to left languages) is stored in presentation 
> format in PDF files, which is the opposite of the logical order that Arabic 
> text is typically stored. Arabic text is typically stored such that the first 
> byte is for the right-most character, but the output of PDFBox has the first 
> byte always being the left-most character. 
> Further, PDF files typically store the presentation form of Arabic characters 
> instead the more general form. For example, U+FB50 instead of U+0671. The 
> presentation form is not supposed to be stored in the logical form, but 
> PDFBox does not normalize them out. 
> The attached patch solves both of these problems using the ICU4J library 
> (http://www.icu-project.org/).  It identifies the dominant text direction of 
> each page and reverses the order of each line (only if any right to left text 
> exists).  It then normalizes the text to remove the presentation forms. 
> An example file is attached.  Without the patch, the following is 
> (incorrectly) produced:
> Hello ﺪﻤﺤﻣ World. 
> With the patch, the following is (correctly) produced:
> Hello محمد World. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-377) Incorrect direction of extracted Arabic Text

Reply via email to