[ 
https://issues.apache.org/jira/browse/PDFBOX-2508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218215#comment-14218215
 ] 

John Hewson edited comment on PDFBOX-2508 at 11/19/14 5:46 PM:
---------------------------------------------------------------

The value returned by PrintTextLocations#getFontSize() is the raw font size in 
the PDF, it's common for this to be scaled-down in the PDF content, perhaps you 
want #getFontSizeInPt() instead?

At one point I tried to fix some of the TextPosition calculations but it ended 
up breaking PDFTextStripper, which seems to rely on the "broken" values.

If you're interested in extracting each glyph but you don't need TextPosition 
instances, then you might want to try subclassing the new PDFStreamEngine in 
the 2.0 trunk, because it provides accurate callbacks for each glyph.


was (Author: jahewson):
The value returned by PrintTextLocations#getFontSize() is the raw font size in 
the PDF, it's common for this to be scaled-down in the PDF content, perhaps you 
want #getFontSizeInPt() instead?

If you're interested in extracting each glyph but you don't need TextPosition 
instances, then you might want to try subclassing the new PDFStreamEngine in 
the 2.0 trunk, because it provides accurate callbacks for each glyph.

> Text extraction getting zero font height, bad widths, and ? for text in this 
> PDF with Type 3 Fonts
> --------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-2508
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2508
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 1.8.7
>            Reporter: Fred Andrews
>         Attachments: badtext.pdf, screenshot of acrobat.png
>
>
> Attached file is just line one from a file where every piece of text has 
> these problems.  All the other lines were removed with Nitro to make a small 
> test case.
> This is the output from PrintTextLocations example:
> String[211.92,356.8801 fs=58.0 xscale=58.0 height=1.75392 space=190528.28 
> width=1.7052002]?
> String[129.84,347.04 fs=58.0 xscale=58.0 height=2.72832 space=288435.66 
> width=2.679596]?
> String[70.32,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 
> width=7.0643997]?
> String[77.3844,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 
> width=4.8720016]?
> String[82.2564,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 
> width=6.333603]?
> String[88.590004,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 
> width=6.577202]?
> String[95.167206,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 
> width=6.0899963]?
> String[101.2572,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 
> width=6.333603]?
> String[107.590805,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 
> width=6.0899963]?
> String[113.6808,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 
> width=4.8720016]?
> String[118.5528,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 
> width=3.1668015]?
> String[121.719604,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 
> width=6.333603]?
> String[128.0532,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 
> width=6.577194]?
> String[134.63042,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 
> width=6.0899963]?
> String[140.72041,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 
> width=3.1667938]?
> String[522.95984,293.28 fs=58.0 xscale=58.0 height=1.36416 space=150394.36 
> width=1.4616089]?
> Fontsize is way too big (should be more like 8), value for space is 
> ridiculous, height is too small.  And each character is coming through as a 
> '?'.  The original file has this on every piece of text.
> In Acrobat everything looks fine, both in the original and in this test case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to