[
https://issues.apache.org/jira/browse/PDFBOX-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15074911#comment-15074911
]
Leo commented on PDFBOX-3175:
-----------------------------
I'm not sure, but I think the problem is because of different
PDFTextStreamEngine's class dyDisplay and TextPosition's maxHeight semantics.
As per Andreas Lehmkühler's comment in PDFBOX-1001, they probably should be
contained in different fields and not their usage should not be mixed up:
https://issues.apache.org/jira/browse/PDFBOX-1001?focusedCommentId=13059335&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13059335
maxHeight is used by getHeight() method in TextPosition which is described as
"This will get the maximum height of all characters in this string." in
JavaDoc, but it is not populated with the font height, but with vertical delta.
I will surely try to port the existing test to my extraction engine, but as of
now it is not trivial, as they have to be ported to my interface. I'm writing
that in Clojure and some sort of work on adapter will be needed to compare the
results of existing PDFTextStripper-based test results to
PDFStreamEngine-Clojure-extended test results.
> PDFTextStreamEngine probably miscalculates text height
> ------------------------------------------------------
>
> Key: PDFBOX-3175
> URL: https://issues.apache.org/jira/browse/PDFBOX-3175
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0
> Reporter: Leo
>
> When parsing a PDF document, TextPosition is created with constant text
> height, about 2 time smaller than character width, regardless of font size.
> The following workaround to calculate dyDisplay fixes the issue:
> float verticalScaling = 1/1000f;
> if (font instanceof PDType3Font) {
> Matrix fontMatrix = font.getFontMatrix();
> verticalScaling = fontMatrix.getValue(1, 1);
> }
> float dyDisplay = bbox.getHeight() * fontSize * verticalScaling;
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]