TextPosition should expose its bounding box
-------------------------------------------
Key: PDFBOX-577
URL: https://issues.apache.org/jira/browse/PDFBOX-577
Project: PDFBox
Issue Type: Improvement
Reporter: Villu Ruusmann
It does not seem to be possible to calculate the bounding box of a TextPosition.
IIUC, TextPosition#getY is the baseline of the text and TextPosition#getHeight
is the absolute height of the text. When I subtract the latter from the former
I get a top line, but this is only correct if the text does not contain
descender characters.
Below is a screenshot (AFM-getHeight.png) which shows the bounding boxes of
TextPositions calculated as {#getX(), #getY() - #getHeight, #getWidth,
#getHeight} painted in random colors. For example, the bounding boxes of
parentheses are severely misplaced, which makes the line-by-line text
extraction impossible.
Right now I've solved the problem by tweaking AFM FontMetrics code so that it
returns BoundingBox#getUpperRightY instead of BoundingBox#getHeight when
queried via PDSimpleFont#getFontHeight(byte[], int, int). Another screenshot
(AFM-getUpperRightY.png) shows how this restores the previously broken text
extraction ability.
It seems like a good idea to rework TextPosition so that it would be aware of
its bounding box:
*) Replace methods PDSimpleFont#getFontWidth(byte[], int, int) and
PDSimpleFont#getFontHeight(byte[], int, int) with a single method
PDSimpleFont#getFontBoundingBox(byte[], int, int)
*) Replace the constructor TextPosition(Matrix, Matrix) with
TextPosition(Matrix, BoundingBox)
*) Add new methods TextPosition#getBoundingBox, TextPosition#getBoundingBoxDir.
This shouldn't affect existing application clients, because TextPosition#getY
and TextPosition#getHeight remain in place.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.