[ 
https://issues.apache.org/jira/browse/PDFBOX-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029202#comment-15029202
 ] 

John Hewson edited comment on PDFBOX-3062 at 11/26/15 7:18 PM:
---------------------------------------------------------------

Technically, the bbox is always correct - it's the bbox, by definition. If 
that's what's in the PDF, then that's the bbox. It's not supposed to be the 
visual bounds of the glyph, nor is it supposed to be the smallest box that all 
glyphs fit within, it's simply a box large enough for all glyphs to fit in. So 
the problem is not with the bbox, it's that we're trying to treat it as 
something that it's not.

Trying to emulate the visual bounds via a combination of bbox and cap height is 
madness when we have access to the glyph's exact bounds already. Either we're 
using the bbox for something, or we're using the glyph's bounds, but anything 
else is not going to end well.

So, the real question is philosophical: are we going to use the bbox or the 
glyph's bounds? We didn't have access to the bounds in the past, so this was 
never a question. We need to choose to either do this or not though - no more 
hacks. There's no reason why it can't be fast.


was (Author: jahewson):
Technically, the bbox is always correct - it's the bbox, by definition. If 
that's what's in the PDF, then that's the bbox. It's not supposed to be the 
visual bounds of the glyph, nor is it supposed to be the smallest box that all 
glyphs fit within, it's simply a box large enough for all glyphs to fit in. So 
the problem is not with the bbox, it's that we're trying to treat it as 
something that it's not.

Trying to emulate the visual bounds via a combination of bbox and cap height is 
madness when we have access to the glyph's exact bounds already. Either we're 
using the bbox for something, or we're using the glyph's bounds, but anything 
else is not going to end well.

So, the real question is philosophical: are we going to use the bbox or the 
glyph's bounds? We didn't have access to the bounds in the past, so this was 
never a question. We need to choose to either do this or not though - no more 
hacks.

> Text extraction and height different in 2.0
> -------------------------------------------
>
>                 Key: PDFBOX-3062
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3062
>             Project: PDFBox
>          Issue Type: Sub-task
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Tilman Hausherr
>            Assignee: Tilman Hausherr
>             Fix For: 2.0.0
>
>         Attachments: 005021-reduced.pdf, 
> PDFBOX-3062-H6NIYQXHLPGD3GI6SNIYINRAZBCDHUCB-reduced-marked-1.png, 
> PDFBOX-3062-H6NIYQXHLPGD3GI6SNIYINRAZBCDHUCB-reduced.pdf, 
> PDFBOX-3062-H6NIYQXHLPGD3GI6SNIYINRAZBCDHUCB.pdf, 
> PDFBOX-3062-N2MOQ7YZICIYGTPLQJAWJ4HLN6CCEMHZ-reduced.pdf, garbled text 2.pdf
>
>
> AR:
> {code}
> WITH THE increasing complexity of optical modules,
> {code}
> 1.8:
> {code}
> WITH THE increasing complexity of optical modules,
> String[39.6,399.6 fs=1.0 xscale=29.888 height=20.114626 space=7.472 
> width=28.214272]W
> String[69.488,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 
> width=3.3176804]I
> String[72.80568,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 
> width=6.0873947]T
> String[78.893074,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 
> width=7.1932907]H
> String[90.71916,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 
> width=6.0873947]T
> String[96.80656,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 
> width=7.1932907]H
> {code}
> 2.0:
> {code}
> W
> ITH THE increasing complexity of optical modules,
> String[39.6,399.6 fs=1.0 xscale=29.888 height=9.584274 space=7.472 
> width=28.209717]W
> String[69.488,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 
> width=3.3177567]I
> String[72.805756,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 
> width=6.0858]T
> String[78.891556,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 
> width=7.1949615]H
> String[90.719315,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 
> width=6.0858]T
> String[96.805115,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 
> width=7.1949615]H
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to