[ 
https://issues.apache.org/jira/browse/PDFBOX-5420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527199#comment-17527199
 ] 

Michael Klink edited comment on PDFBOX-5420 at 4/24/22 3:49 PM:
----------------------------------------------------------------

This is not a bug but a design decision. If you consider the JavaDoc 
documentation of {{TextPosition.getFontSizeInPt}}, you'll see

{code:java}
    /**
     * This will get the font size in pt. To get this size we have to multiply 
the font size from
     * {@link #getFontSize() getFontSize()} with the text matrix (set by the 
"Tm" operator)
     * horizontal scaling factor and truncate the result to integer. The actual 
rendering may appear
     * bigger or smaller depending on the current transformation matrix (set by 
the "cm" operator).
     * To get the size in rendering, use {@link #getXScale() getXScale()}.
     *
     * @return The font size in pt.
     */
    public float getFontSizeInPt()
{code}

Thus, the behavior you observed is the documented behavior!

Nonetheless, one might wonder whether this _documented_ behavior is the 
_desired_ behavior. So you might consider changing your *bug* issue to an 
*improvement* or *wish* issue. Be aware, though, that this effectively would be 
an API change which would be unlikely to be included in a 2.x update. But maybe 
you're still in time for a 3.0 change.

That being said, though, in that case the proper improvement would be 
different: Both the existing and the proposed code only work (in their 
respective fashion) if the considered matrices only scale. As soon as 
non-trivial rotation is involved, the value returned by {{getFontSizeInPt}} can 
be any number whose absolute value is not larger than the value expected for 
the respective implementation. 

Also, both the existing and the proposed implementation focus on the 
_horizontal_ scaling. Wouldn't the _vertical_ extent be more relevant for a 
font size value?

Furthermore, the page *UserUnit* value is ignored. As _The range of supported 
values shall be implementation-dependent,_ though, both the original 
implementation and your fix could claim that only the value {{1}} is 
supported... ;)



was (Author: mkl):
This is not a bug but a design decision. If you consider the JavaDoc 
documentation of {{TextPosition.getFontSizeInPt}}, you'll see

{code:java}
    /**
     * This will get the font size in pt. To get this size we have to multiply 
the font size from
     * {@link #getFontSize() getFontSize()} with the text matrix (set by the 
"Tm" operator)
     * horizontal scaling factor and truncate the result to integer. The actual 
rendering may appear
     * bigger or smaller depending on the current transformation matrix (set by 
the "cm" operator).
     * To get the size in rendering, use {@link #getXScale() getXScale()}.
     *
     * @return The font size in pt.
     */
    public float getFontSizeInPt()
{code}

Thus, the behavior you observed is the documented behavior!

Nonetheless, one might wonder whether this _documented_ behavior is the 
_desired_ behavior. So you might consider changing your *bug* issue to an 
*improvement* or *wish* issue. Be aware, though, that this effectively would be 
an API change which would be unlikely to be included in a 2.x update. But maybe 
you're still in time for a 3.0 change.

That being said, though, in that case the proper improvement would be 
different: Both the existing and the proposed code only work (in their 
respective fashion) if the considered matrices only scale. As soon as 
non-trivial rotation is involved, the value returned by {{getFontSizeInPt}} can 
be any number whose absolute value is not larger than the value expected for 
the respective implementation. 

Furthermore, the page *UserUnit* value is ignored. As _The range of supported 
values shall be implementation-dependent,_ though, both the original 
implementation and your fix could claim that only the value {{1}} is 
supported... ;)

> PDFTextStripper does not use cm to infer correct font size
> ----------------------------------------------------------
>
>                 Key: PDFBOX-5420
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5420
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Franken
>            Priority: Minor
>         Attachments: TextStripperTest.kt, 
> TextStripperUsesTransformationMatrix.java, ec_2.fixed.pdf, 
> image-2022-04-23-14-46-34-929.png
>
>
> *Given*
> Given is a PDF where the cm operator is used to scale the transformation 
> matrix by a factor of 0.02834933. The font size is then set to 282 using the 
> Tf operator. 
> !image-2022-04-23-14-46-34-929.png|width=389,height=84!
>  
> *Error Description*
> When the PdfTextStripper is used to fetch the text from that pdf, the 
> internal representation of the Textpositions contains the wrong font size of 
> 282pt. The correct font size would be 10pt. The reason for this 
> miscalculation is the fact, that the PdfTextStripper does not scale the text 
> size based on the current transformation matrix. 
>  
>  *Proposed fix*
> In the file LegacyPDFStreamEngine.java that bug can be fixed in the showGlyph 
> function. There the fontSizeInPt must be calculated using the following code:
> {code:java}
> processTextPosition(new TextPosition(pageRotation, pageSize.getWidth(),
>         pageSize.getHeight(), translatedTextRenderingMatrix, nextX, nextY,
>         Math.abs(dyDisplay), dxDisplay,
>         Math.abs(spaceWidthDisplay), unicodeMapping, new int[] { code }, font,
>         fontSize,
>         (int)(fontSize * textMatrix.getScalingFactorX() * 
> graphicsState.currentTransformationMatrix.scalingFactorX)));{code}
> *Further remarks*
> To easily triage the error, i attached a unit test and a sample file. The 
> sample was manually edited to remove all unnecessary data and fixed with 
> qpdf. However, i redacted only the content stream, other objects in the pdf 
> are still present, thus the pdf is pretty large. As i'm mainly programming 
> kotlin, i attached the original version of the test i used to debug that 
> issue. There is also a java version attached. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to