[jira] [Commented] (PDFBOX-3175) PDFTextStreamEngine probably miscalculates text height

Tilman Hausherr (JIRA) Thu, 31 Dec 2015 03:08:31 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075896#comment-15075896
 ]


Tilman Hausherr commented on PDFBOX-3175:
-----------------------------------------

Your argument 
{quote}
clustering TextPosition into lines is not a concern of PDFTextStreamEngine, but 
of concern of PDFTextStripper class
{quote}
is of course correct, i.e. that all the "voodoo" should be in PDFTextStripper 
only and PDFTextStreamEngine should have only hard, realistic data. But this 
would also mean that TextPosition would return different data than in 1.8, 
which would be a problem for many users. (We found our bugs in the text 
extraction thanks to people who noticed the values from the PrintTextLocations 
were different, and this includes you) See also the first comment by John in 
PDFBOX-3062.

> PDFTextStreamEngine probably miscalculates text height
> ------------------------------------------------------
>
>                 Key: PDFBOX-3175
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3175
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Leo
>         Attachments: MarketT_140815-1-marked-1-18.png, 
> MarketT_140815-1-marked-1.png, PDFBOX-3175-reduced.pdf, snapshot.png
>
>
> When parsing a PDF document, TextPosition is created with constant text 
> height, about 2 time smaller than character width, regardless of font size.
> The following workaround to calculate dyDisplay fixes the issue:
>         float verticalScaling = 1/1000f;
>         if (font instanceof PDType3Font) {
>             Matrix fontMatrix = font.getFontMatrix();
>             verticalScaling = fontMatrix.getValue(1, 1);
>         }
>         float dyDisplay = bbox.getHeight() * fontSize * verticalScaling;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3175) PDFTextStreamEngine probably miscalculates text height

Reply via email to