[ 
https://issues.apache.org/jira/browse/PDFBOX-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592486#comment-14592486
 ] 

Tilman Hausherr commented on PDFBOX-2831:
-----------------------------------------

[~AndreasMeier] assuming that you know that language - if the "fatha" is 
missing, would the resulting words still make any sense? I tried some solution 
(probably similar to yours) and the result is that the text is extracted 
without the "fatha" diacritic, so for the reduced file only ين would appear 
instead of يَن . I suspect that the reason is that the "fatha" is to be 
assigned to the second glyph and that the code needs to be more complex. If we 
can't fix that one, I suggest to quit the loop and output a log message.

[~carrier] , [~justinl] are you guys still around and can give some help? You 
worked on PDFBOX-444 which was about diacritics in 2009.

> ArrayIndexOutOfBoundsException in mergeDiacritic() on extraction of text with 
> diacritic text
> --------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-2831
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2831
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Andreas Meier
>            Priority: Minor
>         Attachments: PDFBOX-2831.pdf, chya31marked.jpg
>
>
> PDFBox may fail on extraction of text in method mergeDiacritic(TextPosition 
> diacritic):
> {code}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
>       at 
> org.apache.pdfbox.text.TextPosition.mergeDiacritic(TextPosition.java:532)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.processTextPosition(PDFTextStripper.java:945)
>       at 
> org.apache.pdfbox.text.PDFTextStreamEngine.showGlyph(PDFTextStreamEngine.java:229)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:683)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.showTextStrings(PDFStreamEngine.java:593)
>       at 
> org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted.process(ShowTextAdjusted.java:38)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:795)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:462)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
>       at 
> org.apache.pdfbox.text.PDFTextStreamEngine.processPage(PDFTextStreamEngine.java:117)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:369)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:305)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:249)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:210)
>       ... 8 more
> {code}
> The exception is thrown, because variable "unicode" contains two diacritic 
> signs (for example: arabic shaddah U+0651 and arabic fathah U+064E, unicode 
> length = 2), while array "widths" only contains one entry at that time( 
> [x.xxxxxxxx] ).
> Temporary workaround could be to check the size of the array:
> (does not address the actual problem, that unicode and widths variable drift 
> apart)
> {code}
> /**
>      * Merge a single character TextPosition into the current object. This is 
> to be used only for
>      * cases where we have a diacritic that overlaps an existing 
> TextPosition. In a graphical
>      * display, we could overlay them, but for text extraction we need to 
> merge them. Use the
>      * contains() method to test if two objects overlap.
>      *
>      * @param diacritic TextPosition to merge into the current TextPosition.
>      */
>     public void mergeDiacritic(TextPosition diacritic)
>     {
>         if (diacritic.getUnicode().length() > 1)
>         {
>             return;
>         }
>         float diacXStart = diacritic.getXDirAdj();
>         float diacXEnd = diacXStart + diacritic.widths[0];
>         float currCharXStart = getXDirAdj();
>         int strLen = unicode.length();
>         boolean wasAdded = false;
>         for (int i = 0; i < strLen && !wasAdded; i++)
>         {
>             if (i <= (widths.length - 1))
>             {
>                 float currCharXEnd = currCharXStart + widths[i];
>                  // this is the case where there is an overlap of the 
> diacritic character with the
>                  // current character and the previous character. If no 
> previous character, just append
>                  // the diacritic after the current one
>                 if (diacXStart < currCharXStart && diacXEnd <= currCharXEnd)
>                 {
>                     if (i == 0)
>                     {
>                         insertDiacritic(i, diacritic);
>                     }
>                     else
>                     {
>                         float distanceOverlapping1 = diacXEnd - 
> currCharXStart;
>                         float percentage1 = distanceOverlapping1/widths[i];
>                         float distanceOverlapping2 = currCharXStart - 
> diacXStart;
>                         float percentage2 = distanceOverlapping2/widths[i - 
> 1];
>                         if (percentage1 >= percentage2)
>                         {
>                             insertDiacritic(i, diacritic);
>                         }
>                         else
>                         {
>                             insertDiacritic(i - 1, diacritic);
>                         }
>                     }
>                     wasAdded = true;
>                 }
>                 // diacritic completely covers this character and therefore 
> we assume that this is the
>                 // character the diacritic belongs to
>                 else if (diacXStart < currCharXStart && diacXEnd > 
> currCharXEnd)
>                 {
>                     insertDiacritic(i, diacritic);
>                     wasAdded = true;
>                 }
>                 // otherwise, The diacritic modifies this character because 
> its completely
>                 // contained by the character width
>                 else if (diacXStart >= currCharXStart && diacXEnd <= 
> currCharXEnd)
>                 {
>                     insertDiacritic(i, diacritic);
>                     wasAdded = true;
>                 }
>                 // last character in the TextPosition so we add diacritic to 
> the end
>                 else if (diacXStart >= currCharXStart && diacXEnd > 
> currCharXEnd && i == strLen - 1)
>                 {
>                     insertDiacritic(i, diacritic);
>                     wasAdded = true;
>                 }
>                 // couldn't find anything useful so we go to the next 
> character in the TextPosition
>                 currCharXStart += widths[i];
>             } else {
>                 // problem: unicode length and widths size differ
>             }
>         }
>     }
> {code}
> This problem only happened on arabic texts so far. Since there is no evidence 
> that it will occur only in arabic text I did not attach it to another issue. 
> Further investigation needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to