[
https://issues.apache.org/jira/browse/PDFBOX-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Leo updated PDFBOX-3175:
------------------------
Attachment: snapshot.png
Example of text extraction from my engine with my suggested workaround. This
comes from PDFBOX-3038-001033-p2.pdf. The red boxes show the clusters of
TextPosition, their height is the maximum height of all TextPositions in
cluster. The original test with my workaround applied fails, returning "I" and
"Introduction" as a single line, while expecting it to be only "I". My
extraction engine handles the case according to the expected data in the
original test, as can be seen from screenshot.
> PDFTextStreamEngine probably miscalculates text height
> ------------------------------------------------------
>
> Key: PDFBOX-3175
> URL: https://issues.apache.org/jira/browse/PDFBOX-3175
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0
> Reporter: Leo
> Attachments: snapshot.png
>
>
> When parsing a PDF document, TextPosition is created with constant text
> height, about 2 time smaller than character width, regardless of font size.
> The following workaround to calculate dyDisplay fixes the issue:
> float verticalScaling = 1/1000f;
> if (font instanceof PDType3Font) {
> Matrix fontMatrix = font.getFontMatrix();
> verticalScaling = fontMatrix.getValue(1, 1);
> }
> float dyDisplay = bbox.getHeight() * fontSize * verticalScaling;
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]