[jira] [Commented] (PDFBOX-3175) PDFTextStreamEngine probably miscalculates text height

Tilman Hausherr (JIRA) Fri, 01 Jan 2016 02:42:07 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076270#comment-15076270
 ]


Tilman Hausherr commented on PDFBOX-3175:
-----------------------------------------

But both PDFTextStreamEngine and PDFTextStripper have TextLocation objects. If 
we don't divide it by 2, then these TextLocation would appear on top of 
PDFTextStripper, for the people who extend that one to get the positions. (And 
as you, PDFTextStreamEngine won't be made public). The other reason that that 
the font bbbox / 2 is not always the "height" that is stored, sometimes it is a 
different value from the font descriptor (see in the source code, at the 
comment "sometimes the bbox has very high values, but CapHeight is OK").

You can see what I mean by making the change you think about and then running 
the tests, and also compare the output of DrawPrintTextLocations to the 
previous output.

I'd like to set this issue to resolved and end this discussion which is rather 
something for 2.1. I do get your point that the current structure is weird, and 
the API doc incorrect. But just removing the "/ 2" and making one class public 
isn't the magic solution. But you (and Ben McCann) are welcome to help with 
code, ideas or criticism to build a better text extraction in the future.

I suspect that the structure should be 3-part, i.e. a "pure" 
PDFStreamTextEngine without any hacks, a new class that does the heuristics to 
get the height and also offers exact heights if needed, and an improved 
PDFTextStripper that fixes the problems that Ben McCann mentioned in several 
issues.

> PDFTextStreamEngine probably miscalculates text height
> ------------------------------------------------------
>
>                 Key: PDFBOX-3175
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3175
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Leo
>         Attachments: MarketT_140815-1-marked-1-18.png, 
> MarketT_140815-1-marked-1.png, PDFBOX-3175-reduced.pdf, snapshot.png
>
>
> When parsing a PDF document, TextPosition is created with constant text 
> height, about 2 time smaller than character width, regardless of font size.
> The following workaround to calculate dyDisplay fixes the issue:
>         float verticalScaling = 1/1000f;
>         if (font instanceof PDType3Font) {
>             Matrix fontMatrix = font.getFontMatrix();
>             verticalScaling = fontMatrix.getValue(1, 1);
>         }
>         float dyDisplay = bbox.getHeight() * fontSize * verticalScaling;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3175) PDFTextStreamEngine probably miscalculates text height

Reply via email to