[
https://issues.apache.org/jira/browse/PDFBOX-2508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14969351#comment-14969351
]
Tilman Hausherr edited comment on PDFBOX-2508 at 10/22/15 3:59 PM:
-------------------------------------------------------------------
IMHO {{1f /}} in this line in our code is wrong:
{code}
glyphSpaceToTextSpaceFactor = 1f / font.getFontMatrix().getScaleX();
{code}
This line was inserted 10 years ago and was originally meant for all fonts:
http://pdfbox.cvs.sourceforge.net/viewvc/pdfbox/pdfbox/src/org/pdfbox/util/PDFStreamEngine.java?r1=1.22&r2=1.23
The spec has this text:
{quote}
A common practice is to define glyphs in terms of a 1000-unit glyph coordinate
system, in which case the font matrix is \[0.001 0 0 0.001 0 0].
{quote}
With the file from PDFBOX-2794, 1 / 0.001 = 1000. And that is multiplied with
the space width 277.832, so the base value is 277832! A Tf value of 8 means
that the size is now 2222656.
was (Author: tilman):
IMHO this line in our code is wrong:
{code}
glyphSpaceToTextSpaceFactor = 1f / font.getFontMatrix().getScaleX();
{code}
This line was inserted 10 years ago and was originally meant for all fonts:
http://pdfbox.cvs.sourceforge.net/viewvc/pdfbox/pdfbox/src/org/pdfbox/util/PDFStreamEngine.java?r1=1.22&r2=1.23
The spec has this text:
{quote}
A common practice is to define glyphs in terms of a 1000-unit glyph coordinate
system, in which case the font matrix is \[0.001 0 0 0.001 0 0].
{quote}
With the file from PDFBOX-2794, 1 / 0.001 = 1000. And that is multiplied with
the space width 277.832, so the base value is 277832! A Tf value of 8 means
that the size is now 2222656.
> Text extraction getting zero font height, bad widths, and ? for text in this
> PDF with Type 3 Fonts
> --------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-2508
> URL: https://issues.apache.org/jira/browse/PDFBOX-2508
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.7
> Reporter: Fred Andrews
> Labels: type3
> Attachments: badtext.pdf, screenshot of acrobat.png
>
>
> Attached file is just line one from a file where every piece of text has
> these problems. All the other lines were removed with Nitro to make a small
> test case.
> This is the output from PrintTextLocations example:
> String[211.92,356.8801 fs=58.0 xscale=58.0 height=1.75392 space=190528.28
> width=1.7052002]?
> String[129.84,347.04 fs=58.0 xscale=58.0 height=2.72832 space=288435.66
> width=2.679596]?
> String[70.32,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12
> width=7.0643997]?
> String[77.3844,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12
> width=4.8720016]?
> String[82.2564,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12
> width=6.333603]?
> String[88.590004,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12
> width=6.577202]?
> String[95.167206,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12
> width=6.0899963]?
> String[101.2572,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12
> width=6.333603]?
> String[107.590805,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12
> width=6.0899963]?
> String[113.6808,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12
> width=4.8720016]?
> String[118.5528,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12
> width=3.1668015]?
> String[121.719604,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12
> width=6.333603]?
> String[128.0532,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12
> width=6.577194]?
> String[134.63042,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12
> width=6.0899963]?
> String[140.72041,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12
> width=3.1667938]?
> String[522.95984,293.28 fs=58.0 xscale=58.0 height=1.36416 space=150394.36
> width=1.4616089]?
> Fontsize is way too big (should be more like 8), value for space is
> ridiculous, height is too small. And each character is coming through as a
> '?'. The original file has this on every piece of text.
> In Acrobat everything looks fine, both in the original and in this test case.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]