[ 
https://issues.apache.org/jira/browse/PDFBOX-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14043911#comment-14043911
 ] 

Tilman Hausherr commented on PDFBOX-2158:
-----------------------------------------

Sorry I can't do anything about the font problem, I don't know enough about 
text extraction. Maybe somebody else will. The second value in the FontBBox 
array is really -65329.

About the rendering:

I have modified SampledImageReader.getDecodeArray() to avoid the NPE and to use 
the correct part of the decode array if it is a stencil.

There's another exception in Type1CharString related to a lineTo command 
without preceding moveTo command. I have inserted a fix for this case and for a 
similar one, i.e. just do a moveTo instead and log a warning.

This was done in rev 1605545 for the trunk. Branch 1.8 doesn't have the 
exceptions.

Your file is seriously broken. There are at least 3 errors in it. If the file 
was created by your own employer or by a business partner, please direct them 
to this issue. It was created by "clib pdf library" and "modified by itext". 
I'd doubt that itext is to blame, Bruno Lowagie has been in the PDF business 
for a long long time. I assume that itext was used only to modify an existing 
PDF, maybe insert the barcode or the company logo.

> ExtractText missing most of text in this PDF file, due to font bounding box 
> with minus infinity
> -----------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-2158
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2158
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.5
>         Environment: Windows x64
>            Reporter: Joel Hirsh
>         Attachments: negative.text.box.pdf
>
>
> Attached PDF file is missing most of the text when processed by the 
> ExtractText example program
> I traced it down to PDFontDescriptorDictionary.getFontBoundingBox() getting a 
> rectange for COSName.FONT_BBOX  that contained a ymin value of minus 
> infinity. That method then creates a PDRectangle which calculates a bounding 
> box with a ymin value of -65,329, and results in an enormous text size, and 
> things go downhill from there.  The text cannot be matched up, and most of it 
> ends up being discarded.
> I was able to hack a fix by doing a check in the constructor 
> PDRectangle.PDRectangle( COSArray array ) for big negative numbers and 
> setting them to 0.  With that change, all the text came through as expected. 
> However, I don't have enough familiarity with the code to understand what a 
> real fix ought to look like.
> The PDF file looks to be fine by other programs such as Acrobat and NitroPDF



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to