[
https://issues.apache.org/jira/browse/PDFBOX-577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883224#action_12883224
]
Villu Ruusmann commented on PDFBOX-577:
---------------------------------------
I did not go on to implement the proposed solution, because it seemed like too
much work to accomplish my modest PDF text extraction goals.
The patch should include the implementations of
PDSimpleFont#getFontBoundingBox(byte[], int, int) for all subclasses of class
PDSimpleFont. There are many obstacles in the way. Some Font types lack proper
FontBox support, whereas some other Font types do not seem to support the
concept of "bounding boxes" at the desired level of detail (eg. there are the
dimensions of the box, but no information about the baseline location within
the box).
> TextPosition should expose its bounding box
> -------------------------------------------
>
> Key: PDFBOX-577
> URL: https://issues.apache.org/jira/browse/PDFBOX-577
> Project: PDFBox
> Issue Type: Improvement
> Reporter: Villu Ruusmann
> Attachments: AFM-getHeight.png, AFM-getUpperRightY.png
>
>
> It does not seem to be possible to calculate the bounding box of a
> TextPosition.
> IIUC, TextPosition#getY is the baseline of the text and
> TextPosition#getHeight is the absolute height of the text. When I subtract
> the latter from the former I get a top line, but this is only correct if the
> text does not contain descender characters.
> Below is a screenshot (AFM-getHeight.png) which shows the bounding boxes of
> TextPositions calculated as {#getX(), #getY() - #getHeight, #getWidth,
> #getHeight} painted in random colors. For example, the bounding boxes of
> parentheses are severely misplaced, which makes the line-by-line text
> extraction impossible.
> Right now I've solved the problem by tweaking AFM FontMetrics code so that it
> returns BoundingBox#getUpperRightY instead of BoundingBox#getHeight when
> queried via PDSimpleFont#getFontHeight(byte[], int, int). Another screenshot
> (AFM-getUpperRightY.png) shows how this restores the previously broken text
> extraction ability.
> It seems like a good idea to rework TextPosition so that it would be aware of
> its bounding box:
> *) Replace methods PDSimpleFont#getFontWidth(byte[], int, int) and
> PDSimpleFont#getFontHeight(byte[], int, int) with a single method
> PDSimpleFont#getFontBoundingBox(byte[], int, int)
> *) Replace the constructor TextPosition(Matrix, Matrix) with
> TextPosition(Matrix, BoundingBox)
> *) Add new methods TextPosition#getBoundingBox,
> TextPosition#getBoundingBoxDir. This shouldn't affect existing application
> clients, because TextPosition#getY and TextPosition#getHeight remain in place.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.