[
https://issues.apache.org/jira/browse/PDFBOX-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075147#comment-15075147
]
Tilman Hausherr commented on PDFBOX-3175:
-----------------------------------------
I can't tell you what heuristics you should use... the existing ones (both in
PDFTextStreamEngine for metrics, and in PDFTextStripper for decision what is a
word / line) seem to work. That's the "official" way. We're open source, so you
can either change the whole source code, or extend your own PDFTextStreamEngine.
IIRC the reason PDFTextStreamEngine isn't extendable is because the weird
calculations we do are only for PDFTextStripper.
To get really accurate glyph sizes, you'd have to get the path from a glyph and
then get the bounding box. There is no example for this yet (but it can be
done), it is something we're thinking about for 2.1.
In the meantime I may have found the problem, and the solution is similar to
yours but at another place, it is related to PDFBOX-2508 and analog to [
https://svn.apache.org/r1711701 ]. I.e. the suspicious line is
{code}
float height = font.getFontMatrix().transformPoint(0, glyphHeight).y;
{code}
This should be done only for type 3 fonts this way. For all others it is
1/1000. (I was wondering why I didn't get bigger red marks even if I hardcoded
to use {{glyphHeight = capHeight}}, but got correct blue marks in the DrawPrint
tool).
> PDFTextStreamEngine probably miscalculates text height
> ------------------------------------------------------
>
> Key: PDFBOX-3175
> URL: https://issues.apache.org/jira/browse/PDFBOX-3175
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0
> Reporter: Leo
> Attachments: MarketT_140815-1-marked-1-18.png,
> MarketT_140815-1-marked-1.png, snapshot.png
>
>
> When parsing a PDF document, TextPosition is created with constant text
> height, about 2 time smaller than character width, regardless of font size.
> The following workaround to calculate dyDisplay fixes the issue:
> float verticalScaling = 1/1000f;
> if (font instanceof PDType3Font) {
> Matrix fontMatrix = font.getFontMatrix();
> verticalScaling = fontMatrix.getValue(1, 1);
> }
> float dyDisplay = bbox.getHeight() * fontSize * verticalScaling;
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]