[
https://issues.apache.org/jira/browse/PDFBOX-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075072#comment-15075072
]
Leo commented on PDFBOX-3175:
-----------------------------
I've read the discussion of PDFBOX-3062 and it is not clear to me, what should
I use in cases like with this file, i.e. when the 2.0 way of height calculation
fails, but the correct value can be obtained by the calculation in the
suggested workaround. Should we consider such files "bad" and to process them I
need to make the calculation on the callee-side (ignore
TextPosition.getHeight(), obtain Font, obtain bbox from it, multiply height by
fontSize and scaling factor)? Or should some heuristics be applied in
PDFTextStreamEngine, which detects these files and uses the workaround
calculation only in cases when the detection is triggered, so a patch is
needed? For me any of the two options is ok, I can continue using PDFBox's fork
with my fixes for the project. But I would like to know the "official" way,
since it's affects how to continue with my project.
Just to explain myself clearer, I'm not trying to do the "text extraction" as
it is done by extending PdfTextStripper. In particular, I'm not interested in
the order Pdfbox returns the TextPositions or how it joins them into single
lines, and all the "writer" operations PdfTextStripper does. I'm only
interested in a stream of raw TextPositions with correct dimension info and
extend directly PDFTextStreamEngine. My algorithm uses an alternative approach
to join TextPositions itself. Is it a wrong way to do it? PDFTextStreamEngine
for some reason is not declared public in the sources.
> PDFTextStreamEngine probably miscalculates text height
> ------------------------------------------------------
>
> Key: PDFBOX-3175
> URL: https://issues.apache.org/jira/browse/PDFBOX-3175
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0
> Reporter: Leo
> Attachments: MarketT_140815-1-marked-1-18.png,
> MarketT_140815-1-marked-1.png, snapshot.png
>
>
> When parsing a PDF document, TextPosition is created with constant text
> height, about 2 time smaller than character width, regardless of font size.
> The following workaround to calculate dyDisplay fixes the issue:
> float verticalScaling = 1/1000f;
> if (font instanceof PDType3Font) {
> Matrix fontMatrix = font.getFontMatrix();
> verticalScaling = fontMatrix.getValue(1, 1);
> }
> float dyDisplay = bbox.getHeight() * fontSize * verticalScaling;
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]