[
https://issues.apache.org/jira/browse/PDFBOX-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076270#comment-15076270
]
Tilman Hausherr commented on PDFBOX-3175:
-----------------------------------------
But both PDFTextStreamEngine and PDFTextStripper have TextLocation objects. If
we don't divide it by 2, then these TextLocation would appear on top of
PDFTextStripper, for the people who extend that one to get the positions. (And
as you, PDFTextStreamEngine won't be made public). The other reason that that
the font bbbox / 2 is not always the "height" that is stored, sometimes it is a
different value from the font descriptor (see in the source code, at the
comment "sometimes the bbox has very high values, but CapHeight is OK").
You can see what I mean by making the change you think about and then running
the tests, and also compare the output of DrawPrintTextLocations to the
previous output.
I'd like to set this issue to resolved and end this discussion which is rather
something for 2.1. I do get your point that the current structure is weird, and
the API doc incorrect. But just removing the "/ 2" and making one class public
isn't the magic solution. But you (and Ben McCann) are welcome to help with
code, ideas or criticism to build a better text extraction in the future.
I suspect that the structure should be 3-part, i.e. a "pure"
PDFStreamTextEngine without any hacks, a new class that does the heuristics to
get the height and also offers exact heights if needed, and an improved
PDFTextStripper that fixes the problems that Ben McCann mentioned in several
issues.
> PDFTextStreamEngine probably miscalculates text height
> ------------------------------------------------------
>
> Key: PDFBOX-3175
> URL: https://issues.apache.org/jira/browse/PDFBOX-3175
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0
> Reporter: Leo
> Attachments: MarketT_140815-1-marked-1-18.png,
> MarketT_140815-1-marked-1.png, PDFBOX-3175-reduced.pdf, snapshot.png
>
>
> When parsing a PDF document, TextPosition is created with constant text
> height, about 2 time smaller than character width, regardless of font size.
> The following workaround to calculate dyDisplay fixes the issue:
> float verticalScaling = 1/1000f;
> if (font instanceof PDType3Font) {
> Matrix fontMatrix = font.getFontMatrix();
> verticalScaling = fontMatrix.getValue(1, 1);
> }
> float dyDisplay = bbox.getHeight() * fontSize * verticalScaling;
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]