[ 
https://issues.apache.org/jira/browse/PDFBOX-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15074623#comment-15074623
 ] 

Tilman Hausherr commented on PDFBOX-3175:
-----------------------------------------

It would be better to explain what you tried to do, and how PDFBox failed. Your 
change makes the text extraction tests fail, and not just one, but many. You're 
not making any argument why your text extraction is better than the existing 
one.

The tests are at "PDFBox reactor\pdfbox\src\test\resources\input", the output 
is at "PDFBox reactor\pdfbox\target\test-output".

Re correct heights, please run the DrawPrintTextLocations example on your file. 
The red mark is a helper used for text extraction, the blue is the bounding 
box. Ideally, the red mark should cover small glyphs, e.g. "a", "o", "n", etc. 
It is not always perfect, but comes close.

> PDFTextStreamEngine probably miscalculates text height
> ------------------------------------------------------
>
>                 Key: PDFBOX-3175
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3175
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Leo
>
> When parsing a PDF document, TextPosition is created with constant text 
> height, about 2 time smaller than character width, regardless of font size.
> The following workaround to calculate dyDisplay fixes the issue:
>         float verticalScaling = 1/1000f;
>         if (font instanceof PDType3Font) {
>             Matrix fontMatrix = font.getFontMatrix();
>             verticalScaling = fontMatrix.getValue(1, 1);
>         }
>         float dyDisplay = bbox.getHeight() * fontSize * verticalScaling;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to