TextPosition should expose its bounding box
-------------------------------------------

                 Key: PDFBOX-577
                 URL: https://issues.apache.org/jira/browse/PDFBOX-577
             Project: PDFBox
          Issue Type: Improvement
            Reporter: Villu Ruusmann


It does not seem to be possible to calculate the bounding box of a TextPosition.

IIUC, TextPosition#getY is the baseline of the text and TextPosition#getHeight 
is the absolute height of the text. When I subtract the latter from the former 
I get a top line, but this is only correct if the text does not contain 
descender characters.

Below is a screenshot (AFM-getHeight.png) which shows the bounding boxes of 
TextPositions calculated as {#getX(), #getY() - #getHeight, #getWidth, 
#getHeight} painted in random colors. For example, the bounding boxes of 
parentheses are severely misplaced, which makes the line-by-line text 
extraction impossible.

Right now I've solved the problem by tweaking AFM FontMetrics code so that it 
returns BoundingBox#getUpperRightY instead of BoundingBox#getHeight when 
queried via PDSimpleFont#getFontHeight(byte[], int, int). Another screenshot 
(AFM-getUpperRightY.png) shows how this restores the previously broken text 
extraction ability.

It seems like a good idea to rework TextPosition so that it would be aware of 
its bounding box:
*) Replace methods PDSimpleFont#getFontWidth(byte[], int, int) and 
PDSimpleFont#getFontHeight(byte[], int, int) with a single method 
PDSimpleFont#getFontBoundingBox(byte[], int, int)
*) Replace the constructor TextPosition(Matrix, Matrix) with 
TextPosition(Matrix, BoundingBox)
*) Add new methods TextPosition#getBoundingBox, TextPosition#getBoundingBoxDir. 
This shouldn't affect existing application clients, because TextPosition#getY 
and TextPosition#getHeight remain in place.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to