[ https://issues.apache.org/jira/browse/PDFBOX-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16384683#comment-16384683 ]
Julien Férard commented on PDFBOX-4138: --------------------------------------- I played around with the document 003422.pdf (see above) and the different possibilities of placing text that is outside of beads between the beads. The main difficulty is that there is no natural way to guess if a text is before or after a specific bead. The correctness of the result was more a lucky accident than an effect of the quality of the algorithm used. Let's assume that the writing is from left to right and top to bottom (like english). The page can be structured in columns: {noformat} +-------------+ |+----+ +----+| || | | || || 1 | | 2 || || | | || |+----+ +----+| +-------------+{noformat} or in horizontal frames: {noformat} +-------------+ |+-----------+| || 1 || |+-----------+| |+-----------+| || 2 || |+-----------+| +-------------+{noformat} or a mix of both like in 003422.pdf: {noformat} +-------------+ |+-----------+| || 1 || |+-----------+| |+----+ +----+| || | | || || 2 | | 3 || || | | || |+----+ +----+| +-------------+{noformat} For a piece of text outside the beads, the current way to guess if it is before or after a specific bead seems by far too simple (aside from the || issue). Here's another idea: use consecutive beads to determine a kind of direction. Take consecutive beads by two and compute a kind of middle of them. Given a text pos, find the closest middle. An exception should be made if the text is above all beads OR on the left of all beads: this is the first text. Another exception if the text is below OR on the right of all beads: this is the last text. For this structure, {noformat} +------a -----+ |+-----------+| || 1 || |+-----------+| | b | |+----+ +----+| || | | || || 2 |c| 3 || || | | || |+----+ +----+| +------d------+{noformat} it would give the sequence: a - 1 - b - 2 - c - 3 - d. There are probably better solutions, but I think that the class should be seriously refactored before trying to implement something like that. > PDFTextStripper: error in a comparison > -------------------------------------- > > Key: PDFBOX-4138 > URL: https://issues.apache.org/jira/browse/PDFBOX-4138 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 2.0.8 > Reporter: Julien Férard > Priority: Minor > > This is very simple. Maybe I'm wrong, but in PdfTextStripper, l. 844 > > [https://github.com/apache/pdfbox/blob/0e07344c0e3a932f0ca346f7cac4700882c67b5d/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java#L844] > * You want to check if the pos is on the left *and* above the rectangle > (this is better than just on the left or just above); > * The name of the variable contains "LeftAndAbove". > ...and the code contains a `||` (or). -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org