[ https://issues.apache.org/jira/browse/PDFBOX-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16384220#comment-16384220 ]
Julien Férard commented on PDFBOX-4138: --------------------------------------- I don't have any test file. I was just trying to understand the class (this is not easy!), with the idea to subclass it. The code is a bit complex, but I think the idea on beads is clear. h3. The context The charactersByArticle contains textLists (lists of text pos). It's size is 2 times the number of beads + 1, and it's content is: - textList before bead 0 - textList in bead 0 - ... - textList in bead n - textList after bead n (see [https://github.com/apache/pdfbox/blob/0e07344c0e3a932f0ca346f7cac4700882c67b5d/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java#L151]) h3. The portion of code around the line 844 When examining a text pos (among other things), the portion of code tries to find to which textList it belongs. It does this (pseudo code): {code} i=-1 for bead in beads: if bead contains text pos: i = bead index * 2 + 1 break if i == -1: # bead not found for bead in beads: if text pos "before" bead: # there are 3 levels of before in the code i = bead index * 2 break if i == -1: # before bead not found i = charactersByArticle.get(size-1) # after last bead # use charactersByArticle.get(i) {code} (Actually, this is done in one loop in the code.) The question is: what is "before"? The answer in the code (*excluding the suspicious ||*) is: there are 3 levels of "before": "on the left and above" a bead is the best, but if no bead suits, then "on the left" is ok, and finally "above" is the worst. It doesn't make sense to have "on the left OR above" better than "on the left". h3. Aside Another way to see that the code is wrong is the following: we enter the two blocks at the lines 849-857 only if {{notFoundButFirstLeftAndAboveArticleDivisionIndex != -1}}, because : 1. {{else if}} => we need to have {{(x >= rect.getLowerLeftX() && y >= rect.getUpperRightY()) || notFoundButFirstLeftAndAboveArticleDivisionIndex != -1)}} 2. the tests check either {{x < rect.getLowerLeftX()}} or {{y < rect.getUpperRightY()}}, which are both false if {{x >= rect.getLowerLeftX() && y >= rect.getUpperRightY()}}. Thus, {{notFoundButFirstLeftArticleDivisionIndex}} and {{notFoundButFirstAboveArticleDivisionIndex}} may be != -1 only if {{notFoundButFirstLeftAndAboveArticleDivisionIndex}} is already != -1. And conversely, if {{notFoundButFirstLeftAndAboveArticleDivisionIndex == -1}}, then {{notFoundButFirstLeftArticleDivisionIndex == -1}} and {{notFoundButFirstAboveArticleDivisionIndex == -1}}. Why those tests at lines 879-886? We already know that {{notFoundButFirstLeftArticleDivisionIndex}} and {{notFoundButFirstAboveArticleDivisionIndex == -1}} because {{notFoundButFirstLeftAndAboveArticleDivisionIndex == -1}}! A part of the code is unreachable... This doesn't mean that the result will be better if the code is more consistent. I don't know... > PDFTextStripper: error in a comparison > -------------------------------------- > > Key: PDFBOX-4138 > URL: https://issues.apache.org/jira/browse/PDFBOX-4138 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 2.0.8 > Reporter: Julien Férard > Priority: Minor > > This is very simple. Maybe I'm wrong, but in PdfTextStripper, l. 844 > > [https://github.com/apache/pdfbox/blob/0e07344c0e3a932f0ca346f7cac4700882c67b5d/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java#L844] > * You want to check if the pos is on the left *and* above the rectangle > (this is better than just on the left or just above); > * The name of the variable contains "LeftAndAbove". > ...and the code contains a `||` (or). -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org