Franken created PDFBOX-5420:
-------------------------------
Summary: PDFTextStripper does not use cm to infer correct font size
Key: PDFBOX-5420
URL: https://issues.apache.org/jira/browse/PDFBOX-5420
Project: PDFBox
Issue Type: Bug
Reporter: Franken
Attachments: TextStripperTest.kt,
TextStripperUsesTransformationMatrix.java, ec_2.fixed.pdf,
image-2022-04-23-14-46-34-929.png
*Given*
Given is a PDF where the cm operator is used to scale the transformation matrix
by a factor of 0.03. The font size is then set to 282 using the Tf operator.
!image-2022-04-23-14-46-34-929.png|width=389,height=84!
*Error Description*
When the PdfTextStripper is used to fetch the text from that pdf, the internal
representation of the Textpositions contains the wrong font size of 282pt. The
correct font size would be 10pt. The reason for this miscalculation is the
fact, that the PdfTextStripper does not scale the text size based on the
current transformation matrix.
*Proposed fix*
In the file LegacyPDFStreamEngine.java that bug can be fixed in the showGlyph
function. There the fontSizeInPt must be calculated using the following code:
{code:java}
processTextPosition(new TextPosition(pageRotation, pageSize.getWidth(),
pageSize.getHeight(), translatedTextRenderingMatrix, nextX, nextY,
Math.abs(dyDisplay), dxDisplay,
Math.abs(spaceWidthDisplay), unicodeMapping, new int[] { code }, font,
fontSize,
(int)(fontSize * textMatrix.getScalingFactorX() *
graphicsState.currentTransformationMatrix.scalingFactorX)));{code}
*Further remarks*
To easily triage the error, i attached a unit test and a sample file. The
sample was manually edited to remove all unnecessary data and fixed with qpdf.
However, i redacted only the content stream, other objects in the pdf are still
present, thus the pdf is pretty large. As i'm mainly programming kotlin, i
attached the original version of the test i used to debug that issue. There is
also a java version attached.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]