[
https://issues.apache.org/jira/browse/PDFBOX-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075915#comment-15075915
]
Leo edited comment on PDFBOX-3175 at 12/31/15 12:05 PM:
--------------------------------------------------------
PrintTextLocation uses PDFTextStripper as i judge by "public class
PrintTextLocations extends PDFTextStripper" line. TextLocations produced by
PDFTextStripper will continue to be 1/2 as they are now if we move that
division to PDFTextStripper class. At the same time PDFTextStreamEngine class
will be free of the 1/2 miscalculation. We just move that 1/2 line from
constructor call in PDFStreamTextEngine to PDFTextStripper. Under this scenario
no changes for the current users of PDFTextStripper class will be visible. I'm
not sure how easily that line could be integrated, PDFTextStripper is very
messy. But I think it's the only way to keep the compatibility for the old
erroneous value users and return correct data according the original API design.
As for the docstring, I don't think backfitting the API design to an obviously
erroneous behavior is a good idea. The getHeight() should return what it says
it returns, the height.
By the way, I can't find the equivalent of division by 2 line in 1.8 branch.
Was it introduced into that place of PDFTextStreamEngine as a hotfix to comply
with TextPosition values of 1.8? If so, PDFTextStreamEngine's location of that
line is probably not the place, where in 1.8 that 1/2 coefficient was
introduced at all.
was (Author: elel):
PrintTextLocation uses PDFTextStripper as i judge by "public class
PrintTextLocations extends PDFTextStripper" line. TextLocations produced by
PDFTextStripper will continue to be 1/2 as they are now if we move that
division to PDFTextStripper class. At the same time PDFTextStreamEngine class
will be free of the 1/2 miscalculation. We just move that 1/2 line from
constructor call in PDFStreamTextEngine to PDFTextStripper. Under this scenario
no changes for the current users of PDFTextStripper class will be visible. I'm
not sure how easily that line could be integrated, PDFTextStripper is very
messy. But I think it's the only way to keep the compatibility for the old
erroneous value users and return correct data according the original API design.
As for the docstring, I don't think backfitting the API design to an obviously
erroneous behavior is a good idea. The getHeight() should return what it says
it returns, the height.
> PDFTextStreamEngine probably miscalculates text height
> ------------------------------------------------------
>
> Key: PDFBOX-3175
> URL: https://issues.apache.org/jira/browse/PDFBOX-3175
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0
> Reporter: Leo
> Attachments: MarketT_140815-1-marked-1-18.png,
> MarketT_140815-1-marked-1.png, PDFBOX-3175-reduced.pdf, snapshot.png
>
>
> When parsing a PDF document, TextPosition is created with constant text
> height, about 2 time smaller than character width, regardless of font size.
> The following workaround to calculate dyDisplay fixes the issue:
> float verticalScaling = 1/1000f;
> if (font instanceof PDType3Font) {
> Matrix fontMatrix = font.getFontMatrix();
> verticalScaling = fontMatrix.getValue(1, 1);
> }
> float dyDisplay = bbox.getHeight() * fontSize * verticalScaling;
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]