[ 
https://issues.apache.org/jira/browse/PDFBOX-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14591829#comment-14591829
 ] 

Andreas Meier commented on PDFBOX-2831:
---------------------------------------

There are several positions in the File, where the issue happens.
I marked one with a red dot on site 8, see appended image

In this case the variable unicode contains ya (U+064A)  + noon (U+0646), while 
widths only contains one value.The merged diacritic in this case is fathah.

unicode[0] = ي
unicode[1] = م
unicode=مي
You can search for the location on the pdf with the given unicode string

My first guess was that it is about glyphs, that can have two diacritics at the 
same time, because I had another testfile where this was the case.

The new testfile I posted proofed me that my guess was wrong.

I think I need to dig into the code to understand what's happening.

> ArrayIndexOutOfBoundsException in mergeDiacritic() on extraction of text with 
> diacritic text
> --------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-2831
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2831
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Andreas Meier
>            Priority: Minor
>
> PDFBox may fail on extraction of text in method mergeDiacritic(TextPosition 
> diacritic):
> {code}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
>       at 
> org.apache.pdfbox.text.TextPosition.mergeDiacritic(TextPosition.java:532)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.processTextPosition(PDFTextStripper.java:945)
>       at 
> org.apache.pdfbox.text.PDFTextStreamEngine.showGlyph(PDFTextStreamEngine.java:229)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:683)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.showTextStrings(PDFStreamEngine.java:593)
>       at 
> org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted.process(ShowTextAdjusted.java:38)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:795)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:462)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
>       at 
> org.apache.pdfbox.text.PDFTextStreamEngine.processPage(PDFTextStreamEngine.java:117)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:369)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:305)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:249)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:210)
>       ... 8 more
> {code}
> The exception is thrown, because variable "unicode" contains two diacritic 
> signs (for example: arabic shaddah U+0651 and arabic fathah U+064E, unicode 
> length = 2), while array "widths" only contains one entry at that time( 
> [x.xxxxxxxx] ).
> Temporary workaround could be to check the size of the array:
> (does not address the actual problem, that unicode and widths variable drift 
> apart)
> {code}
> /**
>      * Merge a single character TextPosition into the current object. This is 
> to be used only for
>      * cases where we have a diacritic that overlaps an existing 
> TextPosition. In a graphical
>      * display, we could overlay them, but for text extraction we need to 
> merge them. Use the
>      * contains() method to test if two objects overlap.
>      *
>      * @param diacritic TextPosition to merge into the current TextPosition.
>      */
>     public void mergeDiacritic(TextPosition diacritic)
>     {
>         if (diacritic.getUnicode().length() > 1)
>         {
>             return;
>         }
>         float diacXStart = diacritic.getXDirAdj();
>         float diacXEnd = diacXStart + diacritic.widths[0];
>         float currCharXStart = getXDirAdj();
>         int strLen = unicode.length();
>         boolean wasAdded = false;
>         for (int i = 0; i < strLen && !wasAdded; i++)
>         {
>             if (i <= (widths.length - 1))
>             {
>                 float currCharXEnd = currCharXStart + widths[i];
>                  // this is the case where there is an overlap of the 
> diacritic character with the
>                  // current character and the previous character. If no 
> previous character, just append
>                  // the diacritic after the current one
>                 if (diacXStart < currCharXStart && diacXEnd <= currCharXEnd)
>                 {
>                     if (i == 0)
>                     {
>                         insertDiacritic(i, diacritic);
>                     }
>                     else
>                     {
>                         float distanceOverlapping1 = diacXEnd - 
> currCharXStart;
>                         float percentage1 = distanceOverlapping1/widths[i];
>                         float distanceOverlapping2 = currCharXStart - 
> diacXStart;
>                         float percentage2 = distanceOverlapping2/widths[i - 
> 1];
>                         if (percentage1 >= percentage2)
>                         {
>                             insertDiacritic(i, diacritic);
>                         }
>                         else
>                         {
>                             insertDiacritic(i - 1, diacritic);
>                         }
>                     }
>                     wasAdded = true;
>                 }
>                 // diacritic completely covers this character and therefore 
> we assume that this is the
>                 // character the diacritic belongs to
>                 else if (diacXStart < currCharXStart && diacXEnd > 
> currCharXEnd)
>                 {
>                     insertDiacritic(i, diacritic);
>                     wasAdded = true;
>                 }
>                 // otherwise, The diacritic modifies this character because 
> its completely
>                 // contained by the character width
>                 else if (diacXStart >= currCharXStart && diacXEnd <= 
> currCharXEnd)
>                 {
>                     insertDiacritic(i, diacritic);
>                     wasAdded = true;
>                 }
>                 // last character in the TextPosition so we add diacritic to 
> the end
>                 else if (diacXStart >= currCharXStart && diacXEnd > 
> currCharXEnd && i == strLen - 1)
>                 {
>                     insertDiacritic(i, diacritic);
>                     wasAdded = true;
>                 }
>                 // couldn't find anything useful so we go to the next 
> character in the TextPosition
>                 currCharXStart += widths[i];
>             } else {
>                 // problem: unicode length and widths size differ
>             }
>         }
>     }
> {code}
> This problem only happened on arabic texts so far. Since there is no evidence 
> that it will occur only in arabic text I did not attach it to another issue. 
> Further investigation needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to