Franken created PDFBOX-5420:
-------------------------------

             Summary: PDFTextStripper does not use cm to infer correct font size
                 Key: PDFBOX-5420
                 URL: https://issues.apache.org/jira/browse/PDFBOX-5420
             Project: PDFBox
          Issue Type: Bug
            Reporter: Franken
         Attachments: TextStripperTest.kt, 
TextStripperUsesTransformationMatrix.java, ec_2.fixed.pdf, 
image-2022-04-23-14-46-34-929.png

*Given*

Given is a PDF where the cm operator is used to scale the transformation matrix 
by a factor of 0.03. The font size is then set to 282 using the Tf operator. 

!image-2022-04-23-14-46-34-929.png|width=389,height=84!

 

*Error Description*

When the PdfTextStripper is used to fetch the text from that pdf, the internal 
representation of the Textpositions contains the wrong font size of 282pt. The 
correct font size would be 10pt. The reason for this miscalculation is the 
fact, that the PdfTextStripper does not scale the text size based on the 
current transformation matrix. 

 

 *Proposed fix*

In the file LegacyPDFStreamEngine.java that bug can be fixed in the showGlyph 
function. There the fontSizeInPt must be calculated using the following code:
{code:java}
processTextPosition(new TextPosition(pageRotation, pageSize.getWidth(),
        pageSize.getHeight(), translatedTextRenderingMatrix, nextX, nextY,
        Math.abs(dyDisplay), dxDisplay,
        Math.abs(spaceWidthDisplay), unicodeMapping, new int[] { code }, font,
        fontSize,
        (int)(fontSize * textMatrix.getScalingFactorX() * 
graphicsState.currentTransformationMatrix.scalingFactorX)));{code}
*Further remarks*

To easily triage the error, i attached a unit test and a sample file. The 
sample was manually edited to remove all unnecessary data and fixed with qpdf. 
However, i redacted only the content stream, other objects in the pdf are still 
present, thus the pdf is pretty large. As i'm mainly programming kotlin, i 
attached the original version of the test i used to debug that issue. There is 
also a java version attached. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to