Tilman Hausherr commented on PDFBOX-4781:

Wow, that PDF is really broken, here are the errors from PDF.js:
Warning: Indexing all PDF objects
Warning: Invalid stream: "FormatError: Bad FCHECK in flate stream: 120, 194"
Warning: Invalid stream: "FormatError: Bad FCHECK in flate stream: 72, 195"
Warning: Native JPEG decoding failed -- trying to recover: Error during JPEG 
image loading
Warning: Unable to decode image: JpegError: JPEG error: SOI not found
I suspect that this file was opened with an ordinary editor, modified, and then 
saved. Object 9 is said to have a length of 37, but has a length of 53.

There are a lot of other errors, e.g. when opening the images. This is a 
telecom invoice. But PDF.js and Chrome (I didn't try Adobe) do not show the 
company logo (have you ever seen an invoice without company logo?). The third 
page is also not shown, although the text indicates there is one.

Changing the Flate filter code so that it returns an empty result would mean 
that incorrect streams would not be detected by preflight, our PDF/A-1b checker.

I'd prefer that you "hack" PDFBox on your own for that application 
(thumbnails), or refuse to create thumbnails for a broken PDF, i.e. create an 
"X" instead, maybe with a text "thumbnail could not be created, the PDF may be 
corrup ot incomplete". The hack would be OK as long as you don't use the jar 
for anything else.

> PDF files with invalid compressed streams cannot be rendered
> ------------------------------------------------------------
>                 Key: PDFBOX-4781
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4781
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.18
>            Reporter: Arnaud Jeansen
>            Priority: Major
> I am using pdfbox 2.0.18 to generate thumbnails of PDF files.
> The code is basically as follows
> {code:java}
>     byte[] pdfFile = ...;
>     float dpi = 72L;
>     try (PDDocument pdfDocument = PDDocument.load(new 
> ByteArrayInputStream(pdfFile))) {
>       PDFRenderer pdfRenderer = new PDFRenderer(pdfDocument);
>       return pdfRenderer.renderImageWithDPI(0, dpi, ImageType.RGB);
>     } catch (IOException e) {
>       // Error handling
>     }
> {code}
> This works fine but for a few PDF files with an invalid compressed stream.
> Note: These PDF files open fine with a variety of PDF readers and java 
> libraries. Only pdfbox seems to fail on them.
> For those files, I get an error log "FlateFilter: stop reading corrupt stream 
> due to a DataFormatException" *and* an `IOException` with stacktrace
> {noformat}
> Caused by: java.io.IOException: java.util.zip.DataFormatException: invalid 
> distance too far back
>       at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:58)
>       at org.apache.pdfbox.filter.Filter.decode(Filter.java:87)
>       at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:84)
>       at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:175)
>       at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163)
>       at org.apache.pdfbox.pdmodel.PDPage.getContents(PDPage.java:170)
>       at 
> org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:92)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:499)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:483)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:156)
>       at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:269)
>       at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:321)
>       at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:243)
>       at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:229)
>       at 
> com.foocompany.service.PdfImageService.convertFromPdfBinaryToJpegBinary(PdfImageService.java:167)
>       ... 68 common frames omitted
> Caused by: java.util.zip.DataFormatException: invalid distance too far back
>       at java.util.zip.Inflater.inflateBytes(Native Method)
>       at java.util.zip.Inflater.inflate(Inflater.java:259)
>       at java.util.zip.Inflater.inflate(Inflater.java:280)
>       at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:83)
>       at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:50)
>       ... 82 common frames omitted
> {noformat}
> Looking further into `org.apache.pdfbox.filter.FlateFilter` :
> * The underlying `DataFormatException` (= broken content that cannot be 
> decompressed when reading the stream) is forwarded up *only* if nothing could 
> be read from this stream
> (see FlateFilter#decompress)
> * The `DataFormatException` is wrapped unconditionally into an `IOException`.
> (see FlateFilter#decode)
> As a hack, swallowing `DataFormatException` in `FlateFilter#decode` makes 
> things work. I get an error log but a thumbnail is correctly generated.
> I am not sure how to proceed from here. I am willing to write a patch but I 
> am not sure how to address this issue.
> I can also provide a PDF file that exhibits the problem privately.

This message was sent by Atlassian Jira

To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to