[
https://issues.apache.org/jira/browse/PDFBOX-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr updated PDFBOX-3038:
------------------------------------
Attachment: PDFBOX-3038-001033-p2.pdf
> Text extraction shows glyphs with zero height
> ---------------------------------------------
>
> Key: PDFBOX-3038
> URL: https://issues.apache.org/jira/browse/PDFBOX-3038
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0
> Reporter: Tilman Hausherr
> Labels: regression
> Fix For: 2.0.0
>
> Attachments: PDFBOX-3038-001033-p2.pdf
>
>
> This happens with file 001033.pdf:
> 2.0:
> {code}
> String[108.0,663.6 fs=6.96 xscale=6.96 height=0.0 space=12.1104
> width=3.4800034]1
> String[144.0,668.4 fs=9.0 xscale=9.0 height=0.0 space=20.25 width=2.996994]I
> String[147.417,668.4 fs=9.0 xscale=9.0 height=0.0 space=20.25 width=4.5]n
> String[152.337,668.4 fs=9.0 xscale=9.0 height=0.0 space=20.25 width=2.25]
> String[154.88701,668.4 fs=9.0 xscale=9.0 height=0.0 space=20.25
> width=2.501999]t
> String[157.809,668.4 fs=9.0 xscale=9.0 height=0.0 space=20.25 width=4.5]h
> String[162.729,668.4 fs=9.0 xscale=9.0 height=0.0 space=20.25
> width=3.9960022]e
> String[167.145,668.4 fs=9.0 xscale=9.0 height=0.0 space=20.25 width=2.25]
> {code}
> 1.8:
> {code}
> String[108.0,663.6 fs=6.96 xscale=6.96 height=4.57272 space=1.74
> width=3.4800034]1
> String[144.0,668.4 fs=9.0 xscale=9.0 height=5.913 space=2.25 width=2.996994]I
> String[147.417,668.4 fs=9.0 xscale=9.0 height=5.913 space=2.25 width=4.5]n
> String[152.337,668.4 fs=9.0 xscale=9.0 height=5.913 space=2.25 width=2.25]
> String[154.88701,668.4 fs=9.0 xscale=9.0 height=5.913 space=2.25
> width=2.501999]t
> String[157.809,668.4 fs=9.0 xscale=9.0 height=5.913 space=2.25 width=4.5]h
> String[162.729,668.4 fs=9.0 xscale=9.0 height=5.913 space=2.25
> width=3.9960022]e
> String[167.145,668.4 fs=9.0 xscale=9.0 height=5.913 space=2.25 width=2.25]
> {code}
> The font has an empty bbox:
> {code}
> def
> /FontBBox {0 0 0 0}
> {code}
> 1.8 had this code to get the height (in PDSimpleFont):
> {code}
> PDRectangle fontBBox = desc.getFontBoundingBox();
> if (fontBBox != null)
> {
> retval = fontBBox.getHeight() / 2;
> }
> if( retval == 0 )
> {
> retval = desc.getCapHeight();
> }
> if( retval == 0 )
> {
> retval = desc.getAscent();
> }
> if( retval == 0 )
> {
> retval = desc.getXHeight();
> if (retval > 0)
> {
> retval -= desc.getDescent();
> }
> }
> {code}
> 2.0 has only this:
> {code}
> float glyphHeight = font.getBoundingBox().getHeight() / 2;
> {code}
> So 2.0 takes the height from the font itself, and has no Plan B.
> Getting the BBox from the font descriptor brings correct heights. (And a
> better text extraction)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]