[ 
https://issues.apache.org/jira/browse/PDFBOX-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15422946#comment-15422946
 ] 

Tilman Hausherr commented on PDFBOX-3464:
-----------------------------------------

You wrote that "font height for this specific font is determined incorrectly" 
but don't mention why, just that the values aren't what you'd like.

A look at the FontBBox of the two fonts shows that the first font has some 
extremely large max and min y values, while the second font (bold) is OK. That 
is the reason for the sizes. See also the blue rectangles on the right, these 
are the max bounds.

The only thing you can do is to calculate the bounds yourself. A look at the 
DrawPrintTextLocations.java examples shows how to do this.

So it isn't a PDFBox bug. It may be a bug in LibreOffice (I suggest you forward 
them the link of this issue), or in the font itself.

For this specific font, one could use (ascent - descent) / 2 instead of the 
height as it is calculated now.

> character height 3 times higher than expected
> ---------------------------------------------
>
>                 Key: PDFBOX-3464
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3464
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: Roman
>            Priority: Minor
>         Attachments: screenshot-1.png, subnode.docx.pdf
>
>
> The issue basically same as PDFBOX-2749, but wrong sample was attached to it 
> by mistake. Correct PDF is attached here.
> The core of the problem is that font height for this specific font is 
> determined incorrectly, please see code with comments below.
> The issue was reproduced on Pdfbox 1.8.4, but as we tested before, same 
> result we get on 1.8.9 and 2.0 versions.
> {code}
> public class Extractor extends PDFTextStripper {
> //<...CUT...>
>       protected void writePage() throws IOException {
>               for (List<TextPosition> textList : charactersByArticle) { 
> //charactersByArticle was inherited from base class
>                       Iterator textIter = textList.iterator();
> //<...CUT...>
>                       while (textIter.hasNext()) {
>                               TextPosition position = (TextPosition) 
> textIter.next();
> //<...CUT...>
>               PDFontDescriptor fontDescriptor = 
> position.getFont().getFontDescriptor();
> //<...CUT...>
>               float yscale = position.getTextPos().getYScale();
>               float asc = Math.abs(fontDescriptor.getAscent() / 1000 * 
> yscale);
>               float rh = 
> Math.abs(fontDescriptor.getFontBoundingBox().getUpperRightY() / 1000 * 
> yscale);
>               float desc = Math.abs(fontDescriptor.getDescent() / 1000 * 
> yscale);
>               float capHeight = Math.abs(fontDescriptor.getCapHeight() / 1000 
> * yscale);
>               if (capHeight == 0)
>                       capHeight = position.getHeight();
>               float h = (rh + Math.max(Math.max(capHeight, 
> position.getHeight()), asc)) / 2;
> //"h" evaluates to 37.39 (should be between 11 and 12)
> //"desc" evaluates to 2.664
> //"capHeight" evaluates to 37.39
> //"position.getHeight()" evaluates to 33.48
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to