[
https://issues.apache.org/jira/browse/PDFBOX-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076742#comment-15076742
]
John Hewson edited comment on PDFBOX-3175 at 1/3/16 1:21 AM:
-------------------------------------------------------------
The whole 1/2 bbox thing is just how PDFStreamEngine used to work in 1.8, and
PDFTextStreamEngine emulates that behaviour for compatibility for subclasses
that need it. It's obviously not an accurate height but it's also not a
terrible approximation of height - the bbox is often around 2x the em height,
so halving it gives values which approximate the em height (for the entire font
thought, not a single glyph). The whole PDFTextStripper API has all of these
incorrect assumptions built into it and really needs to be thrown away - but
until that happens we're stuck with what we have.
We've historically supported subclassing PDFTextStripper, so we can't remove it
without breaking other's code, and we don't currently offer an alternative.
Personally I don't think PDFTextStripper should be kept around for much longer
- for all its complexity it achieves very little. Whatever replaces
PDFTextStripper will simply subclass PDFStreamEngine directly.
was (Author: jahewson):
The whole 1/2 bbox thing is just how PDFStreamEngine used to work in 1.8, and
PDFTextStreamEngine emulates that behaviour for compatibility for subclasses
that need it. It's obviously not an accurate height but it's also not a
terrible approximation of height - the bbox is often around 2x the em height,
so halving it gives values which approximate the em height (for the entire font
thought, not a single glyph). The whole PDFTextStripper has all of these
incorrect assumptions built into it and really needs to be thrown away - but
until that happens we're stuck with what we have.
We've historically supported subclassing PDFTextStripper, so we can't remove it
without breaking other's code, and we don't currently offer an alternative.
Personally I don't think PDFTextStripper should be kept around for much longer
- for all its complexity it achieves very little. Whatever replaces
PDFTextStripper will simply subclass PDFStreamEngine directly.
> PDFTextStreamEngine probably miscalculates text height
> ------------------------------------------------------
>
> Key: PDFBOX-3175
> URL: https://issues.apache.org/jira/browse/PDFBOX-3175
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0
> Reporter: Leo
> Attachments: MarketT_140815-1-marked-1-18.png,
> MarketT_140815-1-marked-1.png, PDFBOX-3175-reduced.pdf, snapshot.png
>
>
> When parsing a PDF document, TextPosition is created with constant text
> height, about 2 time smaller than character width, regardless of font size.
> The following workaround to calculate dyDisplay fixes the issue:
> float verticalScaling = 1/1000f;
> if (font instanceof PDType3Font) {
> Matrix fontMatrix = font.getFontMatrix();
> verticalScaling = fontMatrix.getValue(1, 1);
> }
> float dyDisplay = bbox.getHeight() * fontSize * verticalScaling;
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]