[jira] [Comment Edited] (PDFBOX-3175) PDFTextStreamEngine probably miscalculates text height

Leo (JIRA) Thu, 31 Dec 2015 04:06:08 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075915#comment-15075915
 ]


Leo edited comment on PDFBOX-3175 at 12/31/15 12:05 PM:
--------------------------------------------------------

PrintTextLocation uses PDFTextStripper as i judge by "public class 
PrintTextLocations extends PDFTextStripper" line. TextLocations produced by 
PDFTextStripper will continue to be 1/2 as they are now if we move that 
division to PDFTextStripper class. At the same time PDFTextStreamEngine class 
will be free of the 1/2 miscalculation. We just move that 1/2 line from 
constructor call in PDFStreamTextEngine to PDFTextStripper. Under this scenario 
no changes for the current users of PDFTextStripper class will be visible. I'm 
not sure how easily that line could be integrated, PDFTextStripper is very 
messy. But I think it's the only way to keep the compatibility for the old 
erroneous value users and return correct data according the original API design.
As for the docstring, I don't think backfitting the API design to an obviously 
erroneous behavior is a good idea. The getHeight() should return what it says 
it returns, the height. 

By the way, I can't find the equivalent of division by 2 line in 1.8 branch. 
Was it introduced into that place of PDFTextStreamEngine as a hotfix to comply 
with TextPosition values of 1.8? If so, PDFTextStreamEngine's location of that 
line is probably not the place, where in 1.8 that 1/2 coefficient was 
introduced at all.


was (Author: elel):
PrintTextLocation uses PDFTextStripper as i judge by "public class 
PrintTextLocations extends PDFTextStripper" line. TextLocations produced by 
PDFTextStripper will continue to be 1/2 as they are now if we move that 
division to PDFTextStripper class. At the same time PDFTextStreamEngine class 
will be free of the 1/2 miscalculation. We just move that 1/2 line from 
constructor call in PDFStreamTextEngine to PDFTextStripper. Under this scenario 
no changes for the current users of PDFTextStripper class will be visible. I'm 
not sure how easily that line could be integrated, PDFTextStripper is very 
messy. But I think it's the only way to keep the compatibility for the old 
erroneous value users and return correct data according the original API design.
As for the docstring, I don't think backfitting the API design to an obviously 
erroneous behavior is a good idea. The getHeight() should return what it says 
it returns, the height. 

> PDFTextStreamEngine probably miscalculates text height
> ------------------------------------------------------
>
>                 Key: PDFBOX-3175
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3175
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Leo
>         Attachments: MarketT_140815-1-marked-1-18.png, 
> MarketT_140815-1-marked-1.png, PDFBOX-3175-reduced.pdf, snapshot.png
>
>
> When parsing a PDF document, TextPosition is created with constant text 
> height, about 2 time smaller than character width, regardless of font size.
> The following workaround to calculate dyDisplay fixes the issue:
>         float verticalScaling = 1/1000f;
>         if (font instanceof PDType3Font) {
>             Matrix fontMatrix = font.getFontMatrix();
>             verticalScaling = fontMatrix.getValue(1, 1);
>         }
>         float dyDisplay = bbox.getHeight() * fontSize * verticalScaling;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-3175) PDFTextStreamEngine probably miscalculates text height

Reply via email to