[jira] [Comment Edited] (PDFBOX-3175) PDFTextStreamEngine probably miscalculates text height

John Hewson (JIRA) Sat, 02 Jan 2016 17:21:52 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076742#comment-15076742
 ]


John Hewson edited comment on PDFBOX-3175 at 1/3/16 1:21 AM:
-------------------------------------------------------------

The whole 1/2 bbox thing is just how PDFStreamEngine used to work in 1.8, and 
PDFTextStreamEngine emulates that behaviour for compatibility for subclasses 
that need it. It's obviously not an accurate height but it's also not a 
terrible approximation of height - the bbox is often around 2x the em height, 
so halving it gives values which approximate the em height (for the entire font 
thought, not a single glyph). The whole PDFTextStripper API has all of these 
incorrect assumptions built into it and really needs to be thrown away - but 
until that happens we're stuck with what we have.

We've historically supported subclassing PDFTextStripper, so we can't remove it 
without breaking other's code, and we don't currently offer an alternative. 
Personally I don't think PDFTextStripper should be kept around for much longer 
- for all its complexity it achieves very little. Whatever replaces 
PDFTextStripper will simply subclass PDFStreamEngine directly.


was (Author: jahewson):
The whole 1/2 bbox thing is just how PDFStreamEngine used to work in 1.8, and 
PDFTextStreamEngine emulates that behaviour for compatibility for subclasses 
that need it. It's obviously not an accurate height but it's also not a 
terrible approximation of height - the bbox is often around 2x the em height, 
so halving it gives values which approximate the em height (for the entire font 
thought, not a single glyph). The whole PDFTextStripper has all of these 
incorrect assumptions built into it and really needs to be thrown away - but 
until that happens we're stuck with what we have.

We've historically supported subclassing PDFTextStripper, so we can't remove it 
without breaking other's code, and we don't currently offer an alternative. 
Personally I don't think PDFTextStripper should be kept around for much longer 
- for all its complexity it achieves very little. Whatever replaces 
PDFTextStripper will simply subclass PDFStreamEngine directly.

> PDFTextStreamEngine probably miscalculates text height
> ------------------------------------------------------
>
>                 Key: PDFBOX-3175
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3175
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Leo
>         Attachments: MarketT_140815-1-marked-1-18.png, 
> MarketT_140815-1-marked-1.png, PDFBOX-3175-reduced.pdf, snapshot.png
>
>
> When parsing a PDF document, TextPosition is created with constant text 
> height, about 2 time smaller than character width, regardless of font size.
> The following workaround to calculate dyDisplay fixes the issue:
>         float verticalScaling = 1/1000f;
>         if (font instanceof PDType3Font) {
>             Matrix fontMatrix = font.getFontMatrix();
>             verticalScaling = fontMatrix.getValue(1, 1);
>         }
>         float dyDisplay = bbox.getHeight() * fontSize * verticalScaling;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-3175) PDFTextStreamEngine probably miscalculates text height

Reply via email to