[ 
https://issues.apache.org/jira/browse/PDFBOX-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15074911#comment-15074911
 ] 

Leo commented on PDFBOX-3175:
-----------------------------

I'm not sure, but I think the problem is because of different 
PDFTextStreamEngine's class dyDisplay and TextPosition's maxHeight semantics. 
As per Andreas Lehmkühler's comment in PDFBOX-1001, they probably should be 
contained in different fields and not their usage should not be mixed up: 
https://issues.apache.org/jira/browse/PDFBOX-1001?focusedCommentId=13059335&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13059335

maxHeight is used by getHeight() method in TextPosition which is described as 
"This will get the maximum height of all characters in this string." in 
JavaDoc, but it is not populated with the font height, but with vertical delta.

I will surely try to port the existing test to my extraction engine, but as of 
now it is not trivial, as they have to be ported to my interface. I'm writing 
that in Clojure and some sort of work on adapter will be needed to compare the 
results of existing PDFTextStripper-based test results to 
PDFStreamEngine-Clojure-extended test results.

> PDFTextStreamEngine probably miscalculates text height
> ------------------------------------------------------
>
>                 Key: PDFBOX-3175
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3175
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Leo
>
> When parsing a PDF document, TextPosition is created with constant text 
> height, about 2 time smaller than character width, regardless of font size.
> The following workaround to calculate dyDisplay fixes the issue:
>         float verticalScaling = 1/1000f;
>         if (font instanceof PDType3Font) {
>             Matrix fontMatrix = font.getFontMatrix();
>             verticalScaling = fontMatrix.getValue(1, 1);
>         }
>         float dyDisplay = bbox.getHeight() * fontSize * verticalScaling;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to