[ https://issues.apache.org/jira/browse/PDFBOX-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593551#comment-14593551 ]
ASF subversion and git services commented on PDFBOX-2831: --------------------------------------------------------- Commit 1686438 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1686438 ] PDFBOX-2831: avoid ArrayIndexOutOfBoundsException if diacritic on ligature > ArrayIndexOutOfBoundsException in mergeDiacritic() on extraction of text with > diacritic text > -------------------------------------------------------------------------------------------- > > Key: PDFBOX-2831 > URL: https://issues.apache.org/jira/browse/PDFBOX-2831 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.8.9, 1.8.10, 2.0.0 > Reporter: Andreas Meier > Priority: Minor > Attachments: PDFBOX-2831.pdf, chya31marked.jpg > > > PDFBox may fail on extraction of text in method mergeDiacritic(TextPosition > diacritic): > {code} > Caused by: java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.pdfbox.text.TextPosition.mergeDiacritic(TextPosition.java:532) > at > org.apache.pdfbox.text.PDFTextStripper.processTextPosition(PDFTextStripper.java:945) > at > org.apache.pdfbox.text.PDFTextStreamEngine.showGlyph(PDFTextStreamEngine.java:229) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:683) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.showTextStrings(PDFStreamEngine.java:593) > at > org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted.process(ShowTextAdjusted.java:38) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:795) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:462) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149) > at > org.apache.pdfbox.text.PDFTextStreamEngine.processPage(PDFTextStreamEngine.java:117) > at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:369) > at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:305) > at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:249) > at > org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:210) > ... 8 more > {code} > The exception is thrown, because variable "unicode" contains two diacritic > signs (for example: arabic shaddah U+0651 and arabic fathah U+064E, unicode > length = 2), while array "widths" only contains one entry at that time( > [x.xxxxxxxx] ). > Temporary workaround could be to check the size of the array: > (does not address the actual problem, that unicode and widths variable drift > apart) > {code} > /** > * Merge a single character TextPosition into the current object. This is > to be used only for > * cases where we have a diacritic that overlaps an existing > TextPosition. In a graphical > * display, we could overlay them, but for text extraction we need to > merge them. Use the > * contains() method to test if two objects overlap. > * > * @param diacritic TextPosition to merge into the current TextPosition. > */ > public void mergeDiacritic(TextPosition diacritic) > { > if (diacritic.getUnicode().length() > 1) > { > return; > } > float diacXStart = diacritic.getXDirAdj(); > float diacXEnd = diacXStart + diacritic.widths[0]; > float currCharXStart = getXDirAdj(); > int strLen = unicode.length(); > boolean wasAdded = false; > for (int i = 0; i < strLen && !wasAdded; i++) > { > if (i <= (widths.length - 1)) > { > float currCharXEnd = currCharXStart + widths[i]; > // this is the case where there is an overlap of the > diacritic character with the > // current character and the previous character. If no > previous character, just append > // the diacritic after the current one > if (diacXStart < currCharXStart && diacXEnd <= currCharXEnd) > { > if (i == 0) > { > insertDiacritic(i, diacritic); > } > else > { > float distanceOverlapping1 = diacXEnd - > currCharXStart; > float percentage1 = distanceOverlapping1/widths[i]; > float distanceOverlapping2 = currCharXStart - > diacXStart; > float percentage2 = distanceOverlapping2/widths[i - > 1]; > if (percentage1 >= percentage2) > { > insertDiacritic(i, diacritic); > } > else > { > insertDiacritic(i - 1, diacritic); > } > } > wasAdded = true; > } > // diacritic completely covers this character and therefore > we assume that this is the > // character the diacritic belongs to > else if (diacXStart < currCharXStart && diacXEnd > > currCharXEnd) > { > insertDiacritic(i, diacritic); > wasAdded = true; > } > // otherwise, The diacritic modifies this character because > its completely > // contained by the character width > else if (diacXStart >= currCharXStart && diacXEnd <= > currCharXEnd) > { > insertDiacritic(i, diacritic); > wasAdded = true; > } > // last character in the TextPosition so we add diacritic to > the end > else if (diacXStart >= currCharXStart && diacXEnd > > currCharXEnd && i == strLen - 1) > { > insertDiacritic(i, diacritic); > wasAdded = true; > } > // couldn't find anything useful so we go to the next > character in the TextPosition > currCharXStart += widths[i]; > } else { > // problem: unicode length and widths size differ > } > } > } > {code} > This problem only happened on arabic texts so far. Since there is no evidence > that it will occur only in arabic text I did not attach it to another issue. > Further investigation needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org