[ 
https://issues.apache.org/jira/browse/PDFBOX-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15074813#comment-15074813
 ] 

Leo commented on PDFBOX-3175:
-----------------------------

Sorry for such brief discription of the problem on my part. Here are more 
detailed explanations:

1. What I'm trying to achieve is to make an text extraction class by extending 
PDFTextStreamEngine class. To make this, I edited the source to make the class 
public and extending it in a way PDFTextStripper does. Perhaps, it is not 
intended for such extensions, and that's the source of the problem, though I 
don't see a reason, why it should be. Unfortunately using PDFTextStripper is a 
costly option for me.

2. Current height calculation as of trunk is performed by the following code:
// 1/2 the bbox is used as the height todo: why?
float glyphHeight = bbox.getHeight() / 2;
...
// transformPoint from glyph space -> text space
float height = font.getFontMatrix().transformPoint(0, glyphHeight).y;
...
float dyDisplay = height * textRenderingMatrix.getScalingFactorY();
So I assume that dyDisplay which is used to as maxHeight parameter to 
TextPosition is bbox-based.

3. I use my own visualization tool similar to the one you've described to draw 
the boxes, the maxHeight is constant at about 1/2 of width. I will do the 
procedure with DrawPrintTextLocations later as you request, and report it's 
result later

4. The issue may be somehow related to PDFBOX-1001 (regression?), the file I'm 
dealing with uses the same generator. At least it pointed me to the place of 
code where to try to make the workaround. The PDF file, if of any interest, may 
be obtained here: http://www.micex.ru/file/151022/MarketT_140815.pdf

5. My workaround is not intended as a patch, it's only the way to get the 
correct heights for the PDF files I'm dealing with. I'm not suggesting it for a 
merge. It is an adaptation of the way the height was calculated in 1.8 branch. 
It is probable that I misunderstand something about what maxHeight should 
contain.

> PDFTextStreamEngine probably miscalculates text height
> ------------------------------------------------------
>
>                 Key: PDFBOX-3175
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3175
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Leo
>
> When parsing a PDF document, TextPosition is created with constant text 
> height, about 2 time smaller than character width, regardless of font size.
> The following workaround to calculate dyDisplay fixes the issue:
>         float verticalScaling = 1/1000f;
>         if (font instanceof PDType3Font) {
>             Matrix fontMatrix = font.getFontMatrix();
>             verticalScaling = fontMatrix.getValue(1, 1);
>         }
>         float dyDisplay = bbox.getHeight() * fontSize * verticalScaling;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to