[ 
https://issues.apache.org/jira/browse/PDFBOX-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16384220#comment-16384220
 ] 

Julien Férard commented on PDFBOX-4138:
---------------------------------------

I don't have any test file. I was just trying to understand the class (this is 
not easy!), with the idea to subclass it. The code is a bit complex, but I 
think the idea on beads is clear.
h3. The context

The charactersByArticle contains textLists (lists of text pos). It's size is 2 
times the number of beads + 1, and it's content is:
 - textList before bead 0
 - textList in bead 0
 - ...
 - textList in bead n
 - textList after bead n

(see 
[https://github.com/apache/pdfbox/blob/0e07344c0e3a932f0ca346f7cac4700882c67b5d/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java#L151])
h3. The portion of code around the line 844

When examining a text pos (among other things), the portion of code tries to 
find to which textList it belongs. It does this (pseudo code):

 
{code}
i=-1
for bead in beads:
    if bead contains text pos:
        i = bead index * 2 + 1
        break

if i == -1: # bead not found
    for bead in beads:
        if text pos "before" bead: # there are 3 levels of before in the code
            i = bead index * 2
            break

if i == -1: # before bead not found
    i = charactersByArticle.get(size-1) # after last bead

# use charactersByArticle.get(i)
{code}
 

(Actually, this is done in one loop in the code.)

The question is: what is "before"? The answer in the code (*excluding the 
suspicious ||*) is: there are 3 levels of "before": "on the left and above" a 
bead is the best, but if no bead suits, then "on the left" is ok, and finally 
"above" is the worst. It doesn't make sense to have "on the left OR above" 
better than "on the left".
h3. Aside

Another way to see that the code is wrong is the following: we enter the two 
blocks at the lines 849-857 only if 
{{notFoundButFirstLeftAndAboveArticleDivisionIndex != -1}}, because : 1. {{else 
if}} => we need to have {{(x >= rect.getLowerLeftX() && y >= 
rect.getUpperRightY()) || notFoundButFirstLeftAndAboveArticleDivisionIndex != 
-1)}} 2. the tests check either {{x < rect.getLowerLeftX()}} or {{y < 
rect.getUpperRightY()}}, which are both false if {{x >= rect.getLowerLeftX() && 
y >= rect.getUpperRightY()}}.
 Thus, {{notFoundButFirstLeftArticleDivisionIndex}} and 
{{notFoundButFirstAboveArticleDivisionIndex}} may be != -1 only if 
{{notFoundButFirstLeftAndAboveArticleDivisionIndex}} is already != -1. And 
conversely, if {{notFoundButFirstLeftAndAboveArticleDivisionIndex == -1}}, then 
{{notFoundButFirstLeftArticleDivisionIndex == -1}} and 
{{notFoundButFirstAboveArticleDivisionIndex == -1}}.
 Why those tests at lines 879-886? We already know that 
{{notFoundButFirstLeftArticleDivisionIndex}} and 
{{notFoundButFirstAboveArticleDivisionIndex == -1}} because 
{{notFoundButFirstLeftAndAboveArticleDivisionIndex == -1}}! A part of the code 
is unreachable...

This doesn't mean that the result will be better if the code is more 
consistent. I don't know...

> PDFTextStripper: error in a comparison
> --------------------------------------
>
>                 Key: PDFBOX-4138
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4138
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.8
>            Reporter: Julien Férard
>            Priority: Minor
>
> This is very simple. Maybe I'm wrong, but in PdfTextStripper, l. 844
>  
> [https://github.com/apache/pdfbox/blob/0e07344c0e3a932f0ca346f7cac4700882c67b5d/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java#L844]
>  * You want to check if the pos is on the left *and* above the rectangle 
> (this is better than just on the left or just above);
>  * The name of the variable contains "LeftAndAbove".
> ...and the code contains a `||` (or).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to