Tilman Hausherr created PDFBOX-3038:
---------------------------------------
Summary: Text extraction shows glyphs with zero height
Key: PDFBOX-3038
URL: https://issues.apache.org/jira/browse/PDFBOX-3038
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 2.0.0
Reporter: Tilman Hausherr
Fix For: 2.0.0
This happens with file 001033.pdf:
2.0:
{code}
String[108.0,663.6 fs=6.96 xscale=6.96 height=0.0 space=12.1104
width=3.4800034]1
String[144.0,668.4 fs=9.0 xscale=9.0 height=0.0 space=20.25 width=2.996994]I
String[147.417,668.4 fs=9.0 xscale=9.0 height=0.0 space=20.25 width=4.5]n
String[152.337,668.4 fs=9.0 xscale=9.0 height=0.0 space=20.25 width=2.25]
String[154.88701,668.4 fs=9.0 xscale=9.0 height=0.0 space=20.25 width=2.501999]t
String[157.809,668.4 fs=9.0 xscale=9.0 height=0.0 space=20.25 width=4.5]h
String[162.729,668.4 fs=9.0 xscale=9.0 height=0.0 space=20.25 width=3.9960022]e
String[167.145,668.4 fs=9.0 xscale=9.0 height=0.0 space=20.25 width=2.25]
{code}
1.8:
{code}
String[108.0,663.6 fs=6.96 xscale=6.96 height=4.57272 space=1.74
width=3.4800034]1
String[144.0,668.4 fs=9.0 xscale=9.0 height=5.913 space=2.25 width=2.996994]I
String[147.417,668.4 fs=9.0 xscale=9.0 height=5.913 space=2.25 width=4.5]n
String[152.337,668.4 fs=9.0 xscale=9.0 height=5.913 space=2.25 width=2.25]
String[154.88701,668.4 fs=9.0 xscale=9.0 height=5.913 space=2.25
width=2.501999]t
String[157.809,668.4 fs=9.0 xscale=9.0 height=5.913 space=2.25 width=4.5]h
String[162.729,668.4 fs=9.0 xscale=9.0 height=5.913 space=2.25 width=3.9960022]e
String[167.145,668.4 fs=9.0 xscale=9.0 height=5.913 space=2.25 width=2.25]
{code}
The font has an empty bbox:
{code}
def
/FontBBox {0 0 0 0}
{code}
1.8 had this code to get the height (in PDSimpleFont):
{code}
PDRectangle fontBBox = desc.getFontBoundingBox();
if (fontBBox != null)
{
retval = fontBBox.getHeight() / 2;
}
if( retval == 0 )
{
retval = desc.getCapHeight();
}
if( retval == 0 )
{
retval = desc.getAscent();
}
if( retval == 0 )
{
retval = desc.getXHeight();
if (retval > 0)
{
retval -= desc.getDescent();
}
}
{code}
2.0 has only this:
{code}
float glyphHeight = font.getBoundingBox().getHeight() / 2;
{code}
So 2.0 takes the height from the font itself, and has no Plan B.
Getting the BBox from the font descriptor brings correct heights. (And a better
text extraction)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]