[ 
https://issues.apache.org/jira/browse/PDFBOX-577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson closed PDFBOX-577.
------------------------------
    Resolution: Invalid

The Ascent and Descent values in the PDF dictionary are **not** used when 
computing glyph positions. In fact, it's common for these values to be missing 
or invalid. In any case, the BBox value is actually what is wanted, but that 
suffers from the same problem.

If somebody wants to tackle this problem in the future, it can be fairly easily 
done in 2.0 with the new APIs provided by PDFont which can extract the BBox 
from the embedded or substituted font - or even compute exact bounds from the 
glyph outlines. A new issue or patch addressing this is welcome.

> TextPosition should expose its bounding box
> -------------------------------------------
>
>                 Key: PDFBOX-577
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-577
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: PDModel
>            Reporter: Villu Ruusmann
>         Attachments: 
> 0001-PDFont.java-Add-methods-to-retreive-the-Ascent-and-D.patch, 
> AFM-getHeight.png, AFM-getUpperRightY.png, textposition-randombg.zip
>
>
> It does not seem to be possible to calculate the bounding box of a 
> TextPosition.
> IIUC, TextPosition#getY is the baseline of the text and 
> TextPosition#getHeight is the absolute height of the text. When I subtract 
> the latter from the former I get a top line, but this is only correct if the 
> text does not contain descender characters.
> Below is a screenshot (AFM-getHeight.png) which shows the bounding boxes of 
> TextPositions calculated as {#getX(), #getY() - #getHeight, #getWidth, 
> #getHeight} painted in random colors. For example, the bounding boxes of 
> parentheses are severely misplaced, which makes the line-by-line text 
> extraction impossible.
> Right now I've solved the problem by tweaking AFM FontMetrics code so that it 
> returns BoundingBox#getUpperRightY instead of BoundingBox#getHeight when 
> queried via PDSimpleFont#getFontHeight(byte[], int, int). Another screenshot 
> (AFM-getUpperRightY.png) shows how this restores the previously broken text 
> extraction ability.
> It seems like a good idea to rework TextPosition so that it would be aware of 
> its bounding box:
> *) Replace methods PDSimpleFont#getFontWidth(byte[], int, int) and 
> PDSimpleFont#getFontHeight(byte[], int, int) with a single method 
> PDSimpleFont#getFontBoundingBox(byte[], int, int)
> *) Replace the constructor TextPosition(Matrix, Matrix) with 
> TextPosition(Matrix, BoundingBox)
> *) Add new methods TextPosition#getBoundingBox, 
> TextPosition#getBoundingBoxDir. This shouldn't affect existing application 
> clients, because TextPosition#getY and TextPosition#getHeight remain in place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to