[jira] [Updated] (PDFBOX-3175) PDFTextStreamEngine probably miscalculates text height

Tilman Hausherr (JIRA) Wed, 30 Dec 2015 05:13:24 -0800

     [ 
https://issues.apache.org/jira/browse/PDFBOX-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tilman Hausherr updated PDFBOX-3175:
------------------------------------
    Attachment: MarketT_140815-1-marked-1-18.png

The red marks with 1.8. Glyphs are missing, because rendering doesn't work 
properly for that file in 1.8. But it shows that the heights are bigger there. 
The strategy in guessing the height has changed between 1.8 and 2.0, see some 
discussion about this at the bottom of PDFBOX-3062.

Btw the text extraction of your file works fine in 2.0 despite the bad height 
calculations. The height is only relevant to decide whether texts on different 
y coordinates should belong to the same line or not. A height that is too small 
would mean that small differences (e.g. superscript / subscript) result in text 
lines being splitted.

> PDFTextStreamEngine probably miscalculates text height
> ------------------------------------------------------
>
>                 Key: PDFBOX-3175
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3175
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Leo
>         Attachments: MarketT_140815-1-marked-1-18.png, 
> MarketT_140815-1-marked-1.png, snapshot.png
>
>
> When parsing a PDF document, TextPosition is created with constant text 
> height, about 2 time smaller than character width, regardless of font size.
> The following workaround to calculate dyDisplay fixes the issue:
>         float verticalScaling = 1/1000f;
>         if (font instanceof PDType3Font) {
>             Matrix fontMatrix = font.getFontMatrix();
>             verticalScaling = fontMatrix.getValue(1, 1);
>         }
>         float dyDisplay = bbox.getHeight() * fontSize * verticalScaling;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (PDFBOX-3175) PDFTextStreamEngine probably miscalculates text height

Reply via email to