[ https://issues.apache.org/jira/browse/PDFBOX-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065499#comment-17065499 ]
Arnaud Jeansen commented on PDFBOX-4781: ---------------------------------------- [~tilman] Oh yes it is indeed incredibly broken, we see several of those *every day* on our platform, all from invoices from Orange. Based on their metadata, they seem to be doing very funky stuff with itext to generate that. It happens for some of their invoices, not all. Anyway, thanks for having a look and confirming that "hacking" it on our side is the best option. > PDF files with invalid compressed streams cannot be rendered > ------------------------------------------------------------ > > Key: PDFBOX-4781 > URL: https://issues.apache.org/jira/browse/PDFBOX-4781 > Project: PDFBox > Issue Type: Bug > Components: Parsing > Affects Versions: 2.0.18 > Reporter: Arnaud Jeansen > Priority: Major > > I am using pdfbox 2.0.18 to generate thumbnails of PDF files. > The code is basically as follows > {code:java} > byte[] pdfFile = ...; > float dpi = 72L; > try (PDDocument pdfDocument = PDDocument.load(new > ByteArrayInputStream(pdfFile))) { > PDFRenderer pdfRenderer = new PDFRenderer(pdfDocument); > return pdfRenderer.renderImageWithDPI(0, dpi, ImageType.RGB); > } catch (IOException e) { > // Error handling > } > {code} > This works fine but for a few PDF files with an invalid compressed stream. > Note: These PDF files open fine with a variety of PDF readers and java > libraries. Only pdfbox seems to fail on them. > For those files, I get an error log "FlateFilter: stop reading corrupt stream > due to a DataFormatException" *and* an `IOException` with stacktrace > {noformat} > Caused by: java.io.IOException: java.util.zip.DataFormatException: invalid > distance too far back > at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:58) > at org.apache.pdfbox.filter.Filter.decode(Filter.java:87) > at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:84) > at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:175) > at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163) > at org.apache.pdfbox.pdmodel.PDPage.getContents(PDPage.java:170) > at > org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:92) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:499) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:483) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:156) > at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:269) > at > org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:321) > at > org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:243) > at > org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:229) > at > com.foocompany.service.PdfImageService.convertFromPdfBinaryToJpegBinary(PdfImageService.java:167) > ... 68 common frames omitted > Caused by: java.util.zip.DataFormatException: invalid distance too far back > at java.util.zip.Inflater.inflateBytes(Native Method) > at java.util.zip.Inflater.inflate(Inflater.java:259) > at java.util.zip.Inflater.inflate(Inflater.java:280) > at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:83) > at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:50) > ... 82 common frames omitted > {noformat} > Looking further into `org.apache.pdfbox.filter.FlateFilter` : > * The underlying `DataFormatException` (= broken content that cannot be > decompressed when reading the stream) is forwarded up *only* if nothing could > be read from this stream > (see FlateFilter#decompress) > * The `DataFormatException` is wrapped unconditionally into an `IOException`. > (see FlateFilter#decode) > As a hack, swallowing `DataFormatException` in `FlateFilter#decode` makes > things work. I get an error log but a thumbnail is correctly generated. > I am not sure how to proceed from here. I am willing to write a patch but I > am not sure how to address this issue. > I can also provide a PDF file that exhibits the problem privately. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org