[ 
https://issues.apache.org/jira/browse/PDFBOX-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16384683#comment-16384683
 ] 

Julien Férard commented on PDFBOX-4138:
---------------------------------------

I played around with the document 003422.pdf (see above) and the different 
possibilities of placing text that is outside of beads between the beads. The 
main difficulty is that there is no natural way to guess if a text is before or 
after a specific bead. The correctness of the result was more a lucky accident 
than an effect of the quality of the algorithm used.

Let's assume that the writing is from left to right and top to bottom (like 
english). The page can be structured in columns:


{noformat}
+-------------+
|+----+ +----+|
||    | |    ||
|| 1  | | 2  ||
||    | |    ||
|+----+ +----+|
+-------------+{noformat}

or in horizontal frames:


{noformat}
+-------------+
|+-----------+|
||    1      ||
|+-----------+|
|+-----------+|
||    2      ||
|+-----------+|
+-------------+{noformat}

or a mix of both like in 003422.pdf:


{noformat}
+-------------+
|+-----------+|
||     1     ||
|+-----------+|
|+----+ +----+|
||    | |    ||
|| 2  | | 3  ||
||    | |    ||
|+----+ +----+|
+-------------+{noformat}
For a piece of text outside the beads, the current way to guess if it is before 
or after a specific bead seems by far too simple (aside from the || issue). 
Here's another idea: use consecutive beads to determine a kind of direction. 
Take consecutive beads by two and compute a kind of middle of them. Given a 
text pos, find the closest middle. An exception should be made if the text is 
above all beads OR on the left of all beads: this is the first text. Another 
exception if the text is below OR on the right of all beads: this is the last 
text.

For this structure,


{noformat}
+------a -----+
|+-----------+|
||     1     ||
|+-----------+|
|      b      |  
|+----+ +----+|
||    | |    ||
|| 2  |c| 3  ||
||    | |    ||
|+----+ +----+|
+------d------+{noformat}


it would give the sequence: a - 1 - b - 2 - c - 3 - d. There are probably 
better solutions, but I think that the class should be seriously refactored 
before trying to implement something like that.

 

> PDFTextStripper: error in a comparison
> --------------------------------------
>
>                 Key: PDFBOX-4138
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4138
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.8
>            Reporter: Julien Férard
>            Priority: Minor
>
> This is very simple. Maybe I'm wrong, but in PdfTextStripper, l. 844
>  
> [https://github.com/apache/pdfbox/blob/0e07344c0e3a932f0ca346f7cac4700882c67b5d/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java#L844]
>  * You want to check if the pos is on the left *and* above the rectangle 
> (this is better than just on the left or just above);
>  * The name of the variable contains "LeftAndAbove".
> ...and the code contains a `||` (or).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to