[ 
https://issues.apache.org/jira/browse/PDFBOX-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076647#comment-15076647
 ] 

John Hewson commented on PDFBOX-3175:
-------------------------------------

Please note that PDFTextStreamEngine _deliberately_ miscalculates text bounds, 
for legacy reasons of maintaining compatibility with PDFTextStripper. No other 
code should be using this class - we've provided PDFStreamEngine for this 
purpose, which performs the correct calculations. You can see an example of 
creating a custom subclass 
[here|https://github.com/apache/pdfbox/blob/trunk/examples/src/main/java/org/apache/pdfbox/examples/rendering/CustomGraphicsStreamEngine.java].

{quote}
I've read the discussion of PDFBOX-3062 and it is not clear to me, what should 
I use in cases like with this file, i.e. when the 2.0 way of height calculation 
fails, but the correct value can be obtained by the calculation in the 
suggested workaround. Should we consider such files "bad" and to process them I 
need to make the calculation on the callee-side (ignore 
TextPosition.getHeight(), obtain Font, obtain bbox from it, multiply height by 
fontSize and scaling factor)? Or should some heuristics be applied in 
PDFTextStreamEngine, which detects these files and uses the workaround 
calculation only in cases when the detection is triggered, so a patch is 
needed? For me any of the two options is ok, I can continue using PDFBox's fork 
with my fixes for the project. But I would like to know the "official" way, 
since it's affects how to continue with my project.
{quote}

There's nothing bad about such PDFs - the bbox is simply a box larger than the 
glyphs in the font - there' no requirement that it's the smallest such box. 
PDFBox should not be using this heuristic at all, but as mentioned, we do so 
for legacy compatibility. If you want accurate bounding boxes, we have 100% 
perfect bounds available by inspecting the vector outlines of the glyphs 
themselves - PDFTextStream doesn't not do this, but we have [an 
example|https://github.com/apache/pdfbox/blob/trunk/examples/src/main/java/org/apache/pdfbox/examples/rendering/CustomPageDrawer.java]
 which does. Take a look at that. PDFBox includes all the pieces to get perfect 
glyph bounds, we just don't use them ourselves yet.

> PDFTextStreamEngine probably miscalculates text height
> ------------------------------------------------------
>
>                 Key: PDFBOX-3175
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3175
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Leo
>         Attachments: MarketT_140815-1-marked-1-18.png, 
> MarketT_140815-1-marked-1.png, PDFBOX-3175-reduced.pdf, snapshot.png
>
>
> When parsing a PDF document, TextPosition is created with constant text 
> height, about 2 time smaller than character width, regardless of font size.
> The following workaround to calculate dyDisplay fixes the issue:
>         float verticalScaling = 1/1000f;
>         if (font instanceof PDType3Font) {
>             Matrix fontMatrix = font.getFontMatrix();
>             verticalScaling = fontMatrix.getValue(1, 1);
>         }
>         float dyDisplay = bbox.getHeight() * fontSize * verticalScaling;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to