[
https://issues.apache.org/jira/browse/PDFBOX-5420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527199#comment-17527199
]
Michael Klink edited comment on PDFBOX-5420 at 4/24/22 3:49 PM:
----------------------------------------------------------------
This is not a bug but a design decision. If you consider the JavaDoc
documentation of {{TextPosition.getFontSizeInPt}}, you'll see
{code:java}
/**
* This will get the font size in pt. To get this size we have to multiply
the font size from
* {@link #getFontSize() getFontSize()} with the text matrix (set by the
"Tm" operator)
* horizontal scaling factor and truncate the result to integer. The actual
rendering may appear
* bigger or smaller depending on the current transformation matrix (set by
the "cm" operator).
* To get the size in rendering, use {@link #getXScale() getXScale()}.
*
* @return The font size in pt.
*/
public float getFontSizeInPt()
{code}
Thus, the behavior you observed is the documented behavior!
Nonetheless, one might wonder whether this _documented_ behavior is the
_desired_ behavior. So you might consider changing your *bug* issue to an
*improvement* or *wish* issue. Be aware, though, that this effectively would be
an API change which would be unlikely to be included in a 2.x update. But maybe
you're still in time for a 3.0 change.
That being said, though, in that case the proper improvement would be
different: Both the existing and the proposed code only work (in their
respective fashion) if the considered matrices only scale. As soon as
non-trivial rotation is involved, the value returned by {{getFontSizeInPt}} can
be any number whose absolute value is not larger than the value expected for
the respective implementation.
Also, both the existing and the proposed implementation focus on the
_horizontal_ scaling. Wouldn't the _vertical_ extent be more relevant for a
font size value?
Furthermore, the page *UserUnit* value is ignored. As _The range of supported
values shall be implementation-dependent,_ though, both the original
implementation and your fix could claim that only the value {{1}} is
supported... ;)
was (Author: mkl):
This is not a bug but a design decision. If you consider the JavaDoc
documentation of {{TextPosition.getFontSizeInPt}}, you'll see
{code:java}
/**
* This will get the font size in pt. To get this size we have to multiply
the font size from
* {@link #getFontSize() getFontSize()} with the text matrix (set by the
"Tm" operator)
* horizontal scaling factor and truncate the result to integer. The actual
rendering may appear
* bigger or smaller depending on the current transformation matrix (set by
the "cm" operator).
* To get the size in rendering, use {@link #getXScale() getXScale()}.
*
* @return The font size in pt.
*/
public float getFontSizeInPt()
{code}
Thus, the behavior you observed is the documented behavior!
Nonetheless, one might wonder whether this _documented_ behavior is the
_desired_ behavior. So you might consider changing your *bug* issue to an
*improvement* or *wish* issue. Be aware, though, that this effectively would be
an API change which would be unlikely to be included in a 2.x update. But maybe
you're still in time for a 3.0 change.
That being said, though, in that case the proper improvement would be
different: Both the existing and the proposed code only work (in their
respective fashion) if the considered matrices only scale. As soon as
non-trivial rotation is involved, the value returned by {{getFontSizeInPt}} can
be any number whose absolute value is not larger than the value expected for
the respective implementation.
Furthermore, the page *UserUnit* value is ignored. As _The range of supported
values shall be implementation-dependent,_ though, both the original
implementation and your fix could claim that only the value {{1}} is
supported... ;)
> PDFTextStripper does not use cm to infer correct font size
> ----------------------------------------------------------
>
> Key: PDFBOX-5420
> URL: https://issues.apache.org/jira/browse/PDFBOX-5420
> Project: PDFBox
> Issue Type: Bug
> Reporter: Franken
> Priority: Minor
> Attachments: TextStripperTest.kt,
> TextStripperUsesTransformationMatrix.java, ec_2.fixed.pdf,
> image-2022-04-23-14-46-34-929.png
>
>
> *Given*
> Given is a PDF where the cm operator is used to scale the transformation
> matrix by a factor of 0.02834933. The font size is then set to 282 using the
> Tf operator.
> !image-2022-04-23-14-46-34-929.png|width=389,height=84!
>
> *Error Description*
> When the PdfTextStripper is used to fetch the text from that pdf, the
> internal representation of the Textpositions contains the wrong font size of
> 282pt. The correct font size would be 10pt. The reason for this
> miscalculation is the fact, that the PdfTextStripper does not scale the text
> size based on the current transformation matrix.
>
> *Proposed fix*
> In the file LegacyPDFStreamEngine.java that bug can be fixed in the showGlyph
> function. There the fontSizeInPt must be calculated using the following code:
> {code:java}
> processTextPosition(new TextPosition(pageRotation, pageSize.getWidth(),
> pageSize.getHeight(), translatedTextRenderingMatrix, nextX, nextY,
> Math.abs(dyDisplay), dxDisplay,
> Math.abs(spaceWidthDisplay), unicodeMapping, new int[] { code }, font,
> fontSize,
> (int)(fontSize * textMatrix.getScalingFactorX() *
> graphicsState.currentTransformationMatrix.scalingFactorX)));{code}
> *Further remarks*
> To easily triage the error, i attached a unit test and a sample file. The
> sample was manually edited to remove all unnecessary data and fixed with
> qpdf. However, i redacted only the content stream, other objects in the pdf
> are still present, thus the pdf is pretty large. As i'm mainly programming
> kotlin, i attached the original version of the test i used to debug that
> issue. There is also a java version attached.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]