[ 
https://issues.apache.org/jira/browse/PDFBOX-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977531#comment-14977531
 ] 

John Hewson edited comment on PDFBOX-3062 at 10/28/15 1:26 AM:
---------------------------------------------------------------

Height isn't calculated in a meaningful way in either 1.8 nor 2.0. 
Font#getHeight() shouldn't exist, as there's no such thing as the logical 
height of a glyph. Height in PDFTextStripper is calculated incorrectly by 
PDFTextStreamEngine#showGlyph(), note that this is deliberate as legacy 
PDFTextStripper behaviour depends on it. If you want to fix this, then start by 
deleting all of that showGlyph override except for the final part which does 
the additional Unicode mapping (as we want to keep that). Then fix 
PDFTextStripper.

I suspect that the calculations in TextPosition are also incorrect. It's 
important to realise that we should dealing with _logical_ width, as returned 
by PDFont#getWidth() (which is in glyph space, so make sure to convert that to 
user space) and _logical_ height, which is simply the current font size (with 
the TM + CTM taken into account). Note that the textRenderingMatrix (TRM) 
passed to onGlyph already has all of these calculations done for you... so use 
that!

See also my reply to [this 
thread|http://mail-archives.apache.org/mod_mbox/pdfbox-users/201510.mbox/%3ccaaphlv-0z+3ssvpxi8bwvbbqrf-vthkajigwxfedbb3vke_...@mail.gmail.com%3e]
Enjoy!


was (Author: jahewson):
Height isn't calculated in a meaningful way in either 1.8 nor 2.0. 
Font#getHeight() shouldn't exist, as there's no such thing as the logical 
height of a glyph. Height in PDFTextStripper is calculated incorrectly by 
PDFTextStreamEngine#showGlyph(), note that this is deliberate as legacy 
PDFTextStripper behaviour depends on it. If you want to fix this, then start by 
deleting all of that showGlyph override except for the final part which does 
the additional Unicode mapping (as we want to keep that). Then fix 
PDFTextStripper.

I suspect that the calculations in TextPosition are also incorrect. It's 
important to realise that we should dealing with _logical_ width, as returned 
by PDFont#getWidth() (which is in glyph space, so make sure to convert that to 
user space) and _logical_ height, which is simply the current font size (with 
the TM + CTM taken into account). Note that the textRenderingMatrix (TRM) 
passed to onGlyph already has all of these calculations done for you... so use 
that!

Enjoy!

> Text extraction and height different in 2.0
> -------------------------------------------
>
>                 Key: PDFBOX-3062
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3062
>             Project: PDFBox
>          Issue Type: Sub-task
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Tilman Hausherr
>         Attachments: 005021-reduced.pdf
>
>
> AR:
> {code}
> WITH THE increasing complexity of optical modules,
> {code}
> 1.8:
> {code}
> WITH THE increasing complexity of optical modules,
> String[39.6,399.6 fs=1.0 xscale=29.888 height=20.114626 space=7.472 
> width=28.214272]W
> String[69.488,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 
> width=3.3176804]I
> String[72.80568,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 
> width=6.0873947]T
> String[78.893074,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 
> width=7.1932907]H
> String[90.71916,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 
> width=6.0873947]T
> String[96.80656,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 
> width=7.1932907]H
> {code}
> 2.0:
> {code}
> W
> ITH THE increasing complexity of optical modules,
> String[39.6,399.6 fs=1.0 xscale=29.888 height=9.584274 space=7.472 
> width=28.209717]W
> String[69.488,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 
> width=3.3177567]I
> String[72.805756,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 
> width=6.0858]T
> String[78.891556,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 
> width=7.1949615]H
> String[90.719315,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 
> width=6.0858]T
> String[96.805115,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 
> width=7.1949615]H
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to