[jira] [Commented] (PDFBOX-3175) PDFTextStreamEngine probably miscalculates text height

Leo (JIRA) Wed, 30 Dec 2015 06:43:18 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075072#comment-15075072
 ]


Leo commented on PDFBOX-3175:
-----------------------------

I've read the discussion of PDFBOX-3062 and it is not clear to me, what should 
I use in cases like with this file, i.e. when the 2.0 way of height calculation 
fails, but the correct value can be obtained by the calculation in the 
suggested workaround. Should we consider such files "bad" and to process them I 
need to make the calculation on the callee-side (ignore 
TextPosition.getHeight(), obtain Font, obtain bbox from it, multiply height by 
fontSize and scaling factor)? Or should some heuristics be applied in 
PDFTextStreamEngine, which detects these files and uses the workaround 
calculation only in cases when the detection is triggered, so a patch is 
needed? For me any of the two options is ok, I can continue using PDFBox's fork 
with my fixes for the project. But I would like to know the "official" way, 
since it's affects how to continue with my project.

Just to explain myself clearer, I'm not trying to do the "text extraction" as 
it is done by extending PdfTextStripper. In particular, I'm not interested in 
the order Pdfbox returns the TextPositions or how it joins them into single 
lines, and all the "writer" operations PdfTextStripper does. I'm only 
interested in a stream of raw TextPositions with correct dimension info and 
extend directly PDFTextStreamEngine. My algorithm uses an alternative approach 
to join TextPositions itself. Is it a wrong way to do it? PDFTextStreamEngine 
for some reason is not declared public in the sources. 

> PDFTextStreamEngine probably miscalculates text height
> ------------------------------------------------------
>
>                 Key: PDFBOX-3175
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3175
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Leo
>         Attachments: MarketT_140815-1-marked-1-18.png, 
> MarketT_140815-1-marked-1.png, snapshot.png
>
>
> When parsing a PDF document, TextPosition is created with constant text 
> height, about 2 time smaller than character width, regardless of font size.
> The following workaround to calculate dyDisplay fixes the issue:
>         float verticalScaling = 1/1000f;
>         if (font instanceof PDType3Font) {
>             Matrix fontMatrix = font.getFontMatrix();
>             verticalScaling = fontMatrix.getValue(1, 1);
>         }
>         float dyDisplay = bbox.getHeight() * fontSize * verticalScaling;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3175) PDFTextStreamEngine probably miscalculates text height

Reply via email to