[jira] [Commented] (PDFBOX-3175) PDFTextStreamEngine probably miscalculates text height

Leo (JIRA) Thu, 31 Dec 2015 02:33:07 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075885#comment-15075885
 ]


Leo commented on PDFBOX-3175:
-----------------------------

The values are probably correct for PDFTextStripper class usage. I thought that 
clustering TextPosition into lines is not a concern of PDFTextStreamEngine, but 
of concern of PDFTextStripper class. So if a particular extention of 
PDFTextStreamEngine class needs the coordinates to be divided by 2 it does not 
mean that every other possible extention will also need them to be divided. I 
think that particular line of code belongs to the part where line calculations 
are done, not where the character stream is produced. Maybe it should be just 
moved to PDFTextStreamEngine?

That all said, of course, if I understand correctly the point of architecturing 
the two classes as separate.

> PDFTextStreamEngine probably miscalculates text height
> ------------------------------------------------------
>
>                 Key: PDFBOX-3175
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3175
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Leo
>         Attachments: MarketT_140815-1-marked-1-18.png, 
> MarketT_140815-1-marked-1.png, PDFBOX-3175-reduced.pdf, snapshot.png
>
>
> When parsing a PDF document, TextPosition is created with constant text 
> height, about 2 time smaller than character width, regardless of font size.
> The following workaround to calculate dyDisplay fixes the issue:
>         float verticalScaling = 1/1000f;
>         if (font instanceof PDType3Font) {
>             Matrix fontMatrix = font.getFontMatrix();
>             verticalScaling = fontMatrix.getValue(1, 1);
>         }
>         float dyDisplay = bbox.getHeight() * fontSize * verticalScaling;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3175) PDFTextStreamEngine probably miscalculates text height

Reply via email to