[
https://issues.apache.org/jira/browse/PDFBOX-577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883580#action_12883580
]
Karl Ward commented on PDFBOX-577:
----------------------------------
I'm actually only interested in the bounding box that fully encapsulates all
the characters from a run of text. So, I am using the Ascent and Descent values
from the font's font descriptor dictionary, along with baseline position, to
calculate a maximum and minimum y for a particular run of text.
Attached is a patch that adds getAscent() and getDescent() to PDFont. These new
methods mimic those found in GfxFont in the xpdf project (which are in fact
used by the pdf2html tool to perform text extraction).
> TextPosition should expose its bounding box
> -------------------------------------------
>
> Key: PDFBOX-577
> URL: https://issues.apache.org/jira/browse/PDFBOX-577
> Project: PDFBox
> Issue Type: Improvement
> Reporter: Villu Ruusmann
> Attachments: AFM-getHeight.png, AFM-getUpperRightY.png
>
>
> It does not seem to be possible to calculate the bounding box of a
> TextPosition.
> IIUC, TextPosition#getY is the baseline of the text and
> TextPosition#getHeight is the absolute height of the text. When I subtract
> the latter from the former I get a top line, but this is only correct if the
> text does not contain descender characters.
> Below is a screenshot (AFM-getHeight.png) which shows the bounding boxes of
> TextPositions calculated as {#getX(), #getY() - #getHeight, #getWidth,
> #getHeight} painted in random colors. For example, the bounding boxes of
> parentheses are severely misplaced, which makes the line-by-line text
> extraction impossible.
> Right now I've solved the problem by tweaking AFM FontMetrics code so that it
> returns BoundingBox#getUpperRightY instead of BoundingBox#getHeight when
> queried via PDSimpleFont#getFontHeight(byte[], int, int). Another screenshot
> (AFM-getUpperRightY.png) shows how this restores the previously broken text
> extraction ability.
> It seems like a good idea to rework TextPosition so that it would be aware of
> its bounding box:
> *) Replace methods PDSimpleFont#getFontWidth(byte[], int, int) and
> PDSimpleFont#getFontHeight(byte[], int, int) with a single method
> PDSimpleFont#getFontBoundingBox(byte[], int, int)
> *) Replace the constructor TextPosition(Matrix, Matrix) with
> TextPosition(Matrix, BoundingBox)
> *) Add new methods TextPosition#getBoundingBox,
> TextPosition#getBoundingBoxDir. This shouldn't affect existing application
> clients, because TextPosition#getY and TextPosition#getHeight remain in place.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.