[
https://issues.apache.org/jira/browse/PDFBOX-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr updated PDFBOX-3175:
------------------------------------
Attachment: MarketT_140815-1-marked-1-18.png
The red marks with 1.8. Glyphs are missing, because rendering doesn't work
properly for that file in 1.8. But it shows that the heights are bigger there.
The strategy in guessing the height has changed between 1.8 and 2.0, see some
discussion about this at the bottom of PDFBOX-3062.
Btw the text extraction of your file works fine in 2.0 despite the bad height
calculations. The height is only relevant to decide whether texts on different
y coordinates should belong to the same line or not. A height that is too small
would mean that small differences (e.g. superscript / subscript) result in text
lines being splitted.
> PDFTextStreamEngine probably miscalculates text height
> ------------------------------------------------------
>
> Key: PDFBOX-3175
> URL: https://issues.apache.org/jira/browse/PDFBOX-3175
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0
> Reporter: Leo
> Attachments: MarketT_140815-1-marked-1-18.png,
> MarketT_140815-1-marked-1.png, snapshot.png
>
>
> When parsing a PDF document, TextPosition is created with constant text
> height, about 2 time smaller than character width, regardless of font size.
> The following workaround to calculate dyDisplay fixes the issue:
> float verticalScaling = 1/1000f;
> if (font instanceof PDType3Font) {
> Matrix fontMatrix = font.getFontMatrix();
> verticalScaling = fontMatrix.getValue(1, 1);
> }
> float dyDisplay = bbox.getHeight() * fontSize * verticalScaling;
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]