[ 
https://issues.apache.org/jira/browse/PDFBOX-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076653#comment-15076653
 ] 

Leo commented on PDFBOX-3175:
-----------------------------

I understand that this miscalculation is deliberate. Still these values are not 
used only by PDFTextStripper but by any class that subclasses PDFTextStripper 
as well. According to API docs of getHeight() method, the TextPosition 
structure should return some height: bounding box or the largest glyph - it 
does not matter which exactly, compared to what it returns now, 1/2 of the 
bounding box which is not close to any of the interpretation of a "height". Or 
does PDFTextStripper, having received it's 1/2 heights, patch the TextPositions 
with doubled values?
I personally don't have anything against these hotfixes. But imagine such 
situation (in fact it was the path I walked): someone needs to process PDF file 
characterwise and needs some adequate height info. She googles for a solution, 
and since the instructions on the web all talk about extending PDFTextStripper 
for that, follows the way. The result is not ok, the heights are wrong, though 
the Javadoc about TextPosition structure confirms her understanding. Then she 
makes all the way digging PDFTextStripper up to PDFStreamEngine just to 
discover that it returns TextPositions that do not comply with API docs.
For me, now that I've already done everything of the above, it's not a problem 
to write a solution, which I did. But if the thing is not fixed somehow, other 
people would need to repeat this source code digging each time.

And if PDFTextStripper is not intended to be extended at all, it should be 
marked as such, just as it is the case with PDFTextStreamEngine now.

> PDFTextStreamEngine probably miscalculates text height
> ------------------------------------------------------
>
>                 Key: PDFBOX-3175
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3175
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Leo
>         Attachments: MarketT_140815-1-marked-1-18.png, 
> MarketT_140815-1-marked-1.png, PDFBOX-3175-reduced.pdf, snapshot.png
>
>
> When parsing a PDF document, TextPosition is created with constant text 
> height, about 2 time smaller than character width, regardless of font size.
> The following workaround to calculate dyDisplay fixes the issue:
>         float verticalScaling = 1/1000f;
>         if (font instanceof PDType3Font) {
>             Matrix fontMatrix = font.getFontMatrix();
>             verticalScaling = fontMatrix.getValue(1, 1);
>         }
>         float dyDisplay = bbox.getHeight() * fontSize * verticalScaling;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to