[ 
https://issues.apache.org/jira/browse/PDFBOX-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592688#comment-14592688
 ] 

Glenn Adams commented on PDFBOX-2831:
-------------------------------------

To correctly handle Arabic, and a number of complex scripts, including complex 
use of the Latin script, one needs to do the following (the description being a 
simplification of the actual process, e.g., ignoring bidi resolution style 
derived shaping boundaries):

* normalize Unicode input strings using one of the Unicode normalization forms;
* sub-divide input text into script runs (and/or use external script styling 
information);
* for each script run, perform GSUB and GPOS processing according to applicable 
features and language bindings;

The output of this process is a sequence of glyph runs, where each glyph run 
consists of an array of glyph indices, per-glyph origin adjustment and 
advancement in both x and y axes, and, typically, an association table that for 
each glyph index, stores a character index in the original character string, 
where zero or more glyph indices may be associated with the same character 
index, and where the order of character indices in the association table need 
not be linear.

To support this in PDFBOX, support for the OpenType Advanced Typographic tables 
(GDEF/GSUB/GPOS) must be added to FONTBOX, and additional support required in 
PDFBOX.

Due to a project I am currently working on, I expect to submit a patch that 
adds GDEF/GSUB/GPOS support to FONTBOX, probably in the next month or two. 
However, I have no plans at this time to work on the other aspects that would 
naturally go into PDFBOX outside of FONTBOX.


> ArrayIndexOutOfBoundsException in mergeDiacritic() on extraction of text with 
> diacritic text
> --------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-2831
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2831
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Andreas Meier
>            Priority: Minor
>         Attachments: PDFBOX-2831.pdf, chya31marked.jpg
>
>
> PDFBox may fail on extraction of text in method mergeDiacritic(TextPosition 
> diacritic):
> {code}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
>       at 
> org.apache.pdfbox.text.TextPosition.mergeDiacritic(TextPosition.java:532)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.processTextPosition(PDFTextStripper.java:945)
>       at 
> org.apache.pdfbox.text.PDFTextStreamEngine.showGlyph(PDFTextStreamEngine.java:229)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:683)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.showTextStrings(PDFStreamEngine.java:593)
>       at 
> org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted.process(ShowTextAdjusted.java:38)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:795)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:462)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
>       at 
> org.apache.pdfbox.text.PDFTextStreamEngine.processPage(PDFTextStreamEngine.java:117)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:369)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:305)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:249)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:210)
>       ... 8 more
> {code}
> The exception is thrown, because variable "unicode" contains two diacritic 
> signs (for example: arabic shaddah U+0651 and arabic fathah U+064E, unicode 
> length = 2), while array "widths" only contains one entry at that time( 
> [x.xxxxxxxx] ).
> Temporary workaround could be to check the size of the array:
> (does not address the actual problem, that unicode and widths variable drift 
> apart)
> {code}
> /**
>      * Merge a single character TextPosition into the current object. This is 
> to be used only for
>      * cases where we have a diacritic that overlaps an existing 
> TextPosition. In a graphical
>      * display, we could overlay them, but for text extraction we need to 
> merge them. Use the
>      * contains() method to test if two objects overlap.
>      *
>      * @param diacritic TextPosition to merge into the current TextPosition.
>      */
>     public void mergeDiacritic(TextPosition diacritic)
>     {
>         if (diacritic.getUnicode().length() > 1)
>         {
>             return;
>         }
>         float diacXStart = diacritic.getXDirAdj();
>         float diacXEnd = diacXStart + diacritic.widths[0];
>         float currCharXStart = getXDirAdj();
>         int strLen = unicode.length();
>         boolean wasAdded = false;
>         for (int i = 0; i < strLen && !wasAdded; i++)
>         {
>             if (i <= (widths.length - 1))
>             {
>                 float currCharXEnd = currCharXStart + widths[i];
>                  // this is the case where there is an overlap of the 
> diacritic character with the
>                  // current character and the previous character. If no 
> previous character, just append
>                  // the diacritic after the current one
>                 if (diacXStart < currCharXStart && diacXEnd <= currCharXEnd)
>                 {
>                     if (i == 0)
>                     {
>                         insertDiacritic(i, diacritic);
>                     }
>                     else
>                     {
>                         float distanceOverlapping1 = diacXEnd - 
> currCharXStart;
>                         float percentage1 = distanceOverlapping1/widths[i];
>                         float distanceOverlapping2 = currCharXStart - 
> diacXStart;
>                         float percentage2 = distanceOverlapping2/widths[i - 
> 1];
>                         if (percentage1 >= percentage2)
>                         {
>                             insertDiacritic(i, diacritic);
>                         }
>                         else
>                         {
>                             insertDiacritic(i - 1, diacritic);
>                         }
>                     }
>                     wasAdded = true;
>                 }
>                 // diacritic completely covers this character and therefore 
> we assume that this is the
>                 // character the diacritic belongs to
>                 else if (diacXStart < currCharXStart && diacXEnd > 
> currCharXEnd)
>                 {
>                     insertDiacritic(i, diacritic);
>                     wasAdded = true;
>                 }
>                 // otherwise, The diacritic modifies this character because 
> its completely
>                 // contained by the character width
>                 else if (diacXStart >= currCharXStart && diacXEnd <= 
> currCharXEnd)
>                 {
>                     insertDiacritic(i, diacritic);
>                     wasAdded = true;
>                 }
>                 // last character in the TextPosition so we add diacritic to 
> the end
>                 else if (diacXStart >= currCharXStart && diacXEnd > 
> currCharXEnd && i == strLen - 1)
>                 {
>                     insertDiacritic(i, diacritic);
>                     wasAdded = true;
>                 }
>                 // couldn't find anything useful so we go to the next 
> character in the TextPosition
>                 currCharXStart += widths[i];
>             } else {
>                 // problem: unicode length and widths size differ
>             }
>         }
>     }
> {code}
> This problem only happened on arabic texts so far. Since there is no evidence 
> that it will occur only in arabic text I did not attach it to another issue. 
> Further investigation needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to