[ 
https://issues.apache.org/jira/browse/PDFBOX-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14064613#comment-14064613
 ] 

Joel Hirsh commented on PDFBOX-2023:
------------------------------------

Have a similar problem in 1.8.6

Its on a Type 3 Font, and its not finding a font descriptor. I don't know 
whether the font descriptor is supposed to come from the PDF file or from 
external data, but PDFTextStripper is returning a font height of zero for 
everything in this PDF file.  I have attached a snippet of the PDF file.  When 
running example program PrintTextLocations everything looks like this:
String[455.4441,114.86914 fs=20.0 xscale=0.9705882 height=0.0 space=0.9705882 
width=5.2411804]P
String[460.68527,114.86914 fs=20.0 xscale=0.9705882 height=0.0 space=0.9705882 
width=4.3676453]a
String[465.05292,114.86914 fs=20.0 xscale=0.9705882 height=0.0 space=0.9705882 
width=4.3676453]g
String[469.42056,114.86914 fs=20.0 xscale=0.9705882 height=0.0 space=0.9705882 
width=3.7853088]e
String[473.20584,114.86914 fs=20.0 xscale=0.9705882 height=0.0 space=0.9705882 
width=3.49411] 
String[476.69998,114.86914 fs=20.0 xscale=0.9705882 height=0.0 space=0.9705882 
width=3.7853088]2

Can't get a trunk version working to test it there, but its definitely broken 
in 1.8.6, and I'd consider this a fairly major bug.

> Text extraction gets nothing / zero font height
> -----------------------------------------------
>
>                 Key: PDFBOX-2023
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2023
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Tilman Hausherr
>         Attachments: PDFBOX-2023.pdf, zero_height.pdf
>
>
> Fred Andrews posted this to the user list and I can confirm that text 
> extraction gets nothing:
> I am using PDFTextStripper on some PDF statements from Bank of America, and 
> everything is coming through as zero height. I traced it down to 
> getFontHeight in org.apache.pdfbox.pdmodel.font.PDSimpleFont, which is indeed 
> getting zero.  The font is a type 3 font and I'm not sure how it should work, 
> but getFontHeight is calling getAFM() and that is returning a null because 
> its not a type 1 font.  Then in the next section in getFontHeight there are 
> no font descriptors, and the zero just flows through all the way through 
> getFontHeight. 
> I searched for anything I could key on to calculate the font height but 
> couldn't find it.  The font size is claimed to be 20 by getFontSize(), 
> although it appears to be more like 8. I did trace to where it got a font 
> size command of twenty, but somehow I'm assuming that would need to be 
> scaled, and I can't see where that might come from.
> The font width on the other hand looks accurate, and I would think something 
> similar to that would be needed, but would really appreciate some guidance on 
> how it should work.  If I have clue on how it should work I can see what I 
> can do to implement it.
> This file displays fine in Acrobat and edits fine in Nitro, so it can't be 
> that invalid.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to