[jira] [Comment Edited] (PDFBOX-2158) ExtractText missing most of text in this PDF file, due to font bounding box with minus infinity

John Hewson (JIRA) Thu, 26 Jun 2014 23:38:18 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045610#comment-14045610
 ]


John Hewson edited comment on PDFBOX-2158 at 6/27/14 6:35 AM:
--------------------------------------------------------------

Looking at the font problem, as noted the FontBBox contains -65329 but I see 
that the Descent is also far too large: 65324. It seems to me that these two 
numbers are two's complements of what were originally signed numbers, if so the 
original values would have been plausible:

{code}
Descent = -(65324 - 65536) = 212
FontBBox = -(-64329 + 65536) = -1207
{code}

We could detect this by looking for values with Math.abs(value) > 32767 and 
apply the equations above, alternatively we could just use the values from the 
font whenever they are available and only fall back to the FontDescriptor if 
the font file is missing.


was (Author: jahewson):
Looking at the font problem, as noted the FontBBox contains -65329 but I see 
that the Descent is also far too large: 65324. It seems to me that these two 
numbers are two's complements of what were originally signed numbers, if so the 
original values would have been plausible:

{code}
Descent = -(65324 - 65536) = 212
FontBBox = -(-64329 + 65536) = -1207
{code}

We could detect this by looking for values with Math.abs(value) > 32767 and 
apply the equations above, alternatively we could just use the values from the 
font whenever they are available and only fall back to the FontDescriptor if 
the font is missing.

> ExtractText missing most of text in this PDF file, due to font bounding box 
> with minus infinity
> -----------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-2158
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2158
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.5
>         Environment: Windows x64
>            Reporter: Joel Hirsh
>         Attachments: negative.text.box.pdf
>
>
> Attached PDF file is missing most of the text when processed by the 
> ExtractText example program
> I traced it down to PDFontDescriptorDictionary.getFontBoundingBox() getting a 
> rectange for COSName.FONT_BBOX  that contained a ymin value of minus 
> infinity. That method then creates a PDRectangle which calculates a bounding 
> box with a ymin value of -65,329, and results in an enormous text size, and 
> things go downhill from there.  The text cannot be matched up, and most of it 
> ends up being discarded.
> I was able to hack a fix by doing a check in the constructor 
> PDRectangle.PDRectangle( COSArray array ) for big negative numbers and 
> setting them to 0.  With that change, all the text came through as expected. 
> However, I don't have enough familiarity with the code to understand what a 
> real fix ought to look like.
> The PDF file looks to be fine by other programs such as Acrobat and NitroPDF



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (PDFBOX-2158) ExtractText missing most of text in this PDF file, due to font bounding box with minus infinity

Reply via email to