[
https://issues.apache.org/jira/browse/PDFBOX-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075896#comment-15075896
]
Tilman Hausherr commented on PDFBOX-3175:
-----------------------------------------
Your argument
{quote}
clustering TextPosition into lines is not a concern of PDFTextStreamEngine, but
of concern of PDFTextStripper class
{quote}
is of course correct, i.e. that all the "voodoo" should be in PDFTextStripper
only and PDFTextStreamEngine should have only hard, realistic data. But this
would also mean that TextPosition would return different data than in 1.8,
which would be a problem for many users. (We found our bugs in the text
extraction thanks to people who noticed the values from the PrintTextLocations
were different, and this includes you) See also the first comment by John in
PDFBOX-3062.
> PDFTextStreamEngine probably miscalculates text height
> ------------------------------------------------------
>
> Key: PDFBOX-3175
> URL: https://issues.apache.org/jira/browse/PDFBOX-3175
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0
> Reporter: Leo
> Attachments: MarketT_140815-1-marked-1-18.png,
> MarketT_140815-1-marked-1.png, PDFBOX-3175-reduced.pdf, snapshot.png
>
>
> When parsing a PDF document, TextPosition is created with constant text
> height, about 2 time smaller than character width, regardless of font size.
> The following workaround to calculate dyDisplay fixes the issue:
> float verticalScaling = 1/1000f;
> if (font instanceof PDType3Font) {
> Matrix fontMatrix = font.getFontMatrix();
> verticalScaling = fontMatrix.getValue(1, 1);
> }
> float dyDisplay = bbox.getHeight() * fontSize * verticalScaling;
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]