[
https://issues.apache.org/jira/browse/PDFBOX-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977531#comment-14977531
]
John Hewson edited comment on PDFBOX-3062 at 10/28/15 1:26 AM:
---------------------------------------------------------------
Height isn't calculated in a meaningful way in either 1.8 nor 2.0.
Font#getHeight() shouldn't exist, as there's no such thing as the logical
height of a glyph. Height in PDFTextStripper is calculated incorrectly by
PDFTextStreamEngine#showGlyph(), note that this is deliberate as legacy
PDFTextStripper behaviour depends on it. If you want to fix this, then start by
deleting all of that showGlyph override except for the final part which does
the additional Unicode mapping (as we want to keep that). Then fix
PDFTextStripper.
I suspect that the calculations in TextPosition are also incorrect. It's
important to realise that we should dealing with _logical_ width, as returned
by PDFont#getWidth() (which is in glyph space, so make sure to convert that to
user space) and _logical_ height, which is simply the current font size (with
the TM + CTM taken into account). Note that the textRenderingMatrix (TRM)
passed to onGlyph already has all of these calculations done for you... so use
that!
See also my reply to [this
thread|http://mail-archives.apache.org/mod_mbox/pdfbox-users/201510.mbox/%3ccaaphlv-0z+3ssvpxi8bwvbbqrf-vthkajigwxfedbb3vke_...@mail.gmail.com%3e]
Enjoy!
was (Author: jahewson):
Height isn't calculated in a meaningful way in either 1.8 nor 2.0.
Font#getHeight() shouldn't exist, as there's no such thing as the logical
height of a glyph. Height in PDFTextStripper is calculated incorrectly by
PDFTextStreamEngine#showGlyph(), note that this is deliberate as legacy
PDFTextStripper behaviour depends on it. If you want to fix this, then start by
deleting all of that showGlyph override except for the final part which does
the additional Unicode mapping (as we want to keep that). Then fix
PDFTextStripper.
I suspect that the calculations in TextPosition are also incorrect. It's
important to realise that we should dealing with _logical_ width, as returned
by PDFont#getWidth() (which is in glyph space, so make sure to convert that to
user space) and _logical_ height, which is simply the current font size (with
the TM + CTM taken into account). Note that the textRenderingMatrix (TRM)
passed to onGlyph already has all of these calculations done for you... so use
that!
Enjoy!
> Text extraction and height different in 2.0
> -------------------------------------------
>
> Key: PDFBOX-3062
> URL: https://issues.apache.org/jira/browse/PDFBOX-3062
> Project: PDFBox
> Issue Type: Sub-task
> Components: Text extraction
> Affects Versions: 2.0.0
> Reporter: Tilman Hausherr
> Attachments: 005021-reduced.pdf
>
>
> AR:
> {code}
> WITH THE increasing complexity of optical modules,
> {code}
> 1.8:
> {code}
> WITH THE increasing complexity of optical modules,
> String[39.6,399.6 fs=1.0 xscale=29.888 height=20.114626 space=7.472
> width=28.214272]W
> String[69.488,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075
> width=3.3176804]I
> String[72.80568,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075
> width=6.0873947]T
> String[78.893074,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075
> width=7.1932907]H
> String[90.71916,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075
> width=6.0873947]T
> String[96.80656,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075
> width=7.1932907]H
> {code}
> 2.0:
> {code}
> W
> ITH THE increasing complexity of optical modules,
> String[39.6,399.6 fs=1.0 xscale=29.888 height=9.584274 space=7.472
> width=28.209717]W
> String[69.488,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075
> width=3.3177567]I
> String[72.805756,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075
> width=6.0858]T
> String[78.891556,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075
> width=7.1949615]H
> String[90.719315,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075
> width=6.0858]T
> String[96.805115,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075
> width=7.1949615]H
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]