[ 
https://issues.apache.org/jira/browse/PDFBOX-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arnaud Jeansen updated PDFBOX-4781:
-----------------------------------
    Description: 
I am using pdfbox 2.0.18 to generate thumbnails of PDF files.

The code is basically as follows

{code:java}
    byte[] pdfFile = ...;
    float dpi = 72L;
    try (PDDocument pdfDocument = PDDocument.load(new 
ByteArrayInputStream(pdfFile))) {
      PDFRenderer pdfRenderer = new PDFRenderer(pdfDocument);
      return pdfRenderer.renderImageWithDPI(0, dpi, ImageType.RGB);
    } catch (IOException e) {
      // Error handling
    }
{code}

This works fine but for a few PDF files with an invalid compressed stream.
Note: Thes PDF files open fine with a variety of PDF readers and java 
libraries. Only pdfbox seems to fail on them.

For those files, I get an error log "FlateFilter: stop reading corrupt stream 
due to a DataFormatException" *and* an `IOException` with stacktrace

{noformat}
Caused by: java.io.IOException: java.util.zip.DataFormatException: invalid 
distance too far back
        at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:58)
        at org.apache.pdfbox.filter.Filter.decode(Filter.java:87)
        at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:84)
        at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:175)
        at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163)
        at org.apache.pdfbox.pdmodel.PDPage.getContents(PDPage.java:170)
        at 
org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:92)
        at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:499)
        at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:483)
        at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:156)
        at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:269)
        at 
org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:321)
        at 
org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:243)
        at 
org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:229)
        at 
com.foocompany.service.PdfImageService.convertFromPdfBinaryToJpegBinary(PdfImageService.java:167)
        ... 68 common frames omitted
Caused by: java.util.zip.DataFormatException: invalid distance too far back
        at java.util.zip.Inflater.inflateBytes(Native Method)
        at java.util.zip.Inflater.inflate(Inflater.java:259)
        at java.util.zip.Inflater.inflate(Inflater.java:280)
        at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:83)
        at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:50)
        ... 82 common frames omitted
{noformat}

Looking further into `org.apache.pdfbox.filter.FlateFilter` :
* The underlying `DataFormatException` (= broken content that cannot be 
decompressed when reading the stream) is forwarded up *only* if nothing could 
be read from this stream
(see FlateFilter#decompress)
* The `DataFormatException` is wrapped unconditionally into an `IOException`.
(see FlateFilter#decode)

As a hack, swallowing `DataFormatException` in `FlateFilter#decode` makes 
things work. I get an error log but a thumbnail is correctly generated.

I am not sure how to proceed from here. I am willing to write a patch but I am 
not sure how to address this issue.

I can also provide a PDF file that exhibits the problem privately.

  was:
I am using pdfbox 2.0.18 to generate thumbnails of PDF files.

The code is basically as follows

{code:java}
    byte[] pdfFile = ...;
    float dpi = 72L;
    try (PDDocument pdfDocument = PDDocument.load(new 
ByteArrayInputStream(pdfFile))) {
      PDFRenderer pdfRenderer = new PDFRenderer(pdfDocument);
      return pdfRenderer.renderImageWithDPI(0, dpi, ImageType.RGB);
    } catch (IOException e) {
      // Error handling
    }
{code}

This works fine but for a few PDF files with an invalid compressed stream.
Note: Thes PDF files open fine with a variety of PDF readers and java 
libraries. Only pdfbox seems to fail on them.

For those files, I get an error log "FlateFilter: stop reading corrupt stream 
due to a DataFormatException" *and* an `IOException` with stacktrace

{noformat}
Caused by: java.io.IOException: java.util.zip.DataFormatException: invalid 
distance too far back
        at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:58)
        at org.apache.pdfbox.filter.Filter.decode(Filter.java:87)
        at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:84)
        at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:175)
        at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163)
        at org.apache.pdfbox.pdmodel.PDPage.getContents(PDPage.java:170)
        at 
org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:92)
        at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:499)
        at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:483)
        at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:156)
        at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:269)
        at 
org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:321)
        at 
org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:243)
        at 
org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:229)
        at 
com.foocompany.service.PdfImageService.convertFromPdfBinaryToJpegBinary(PdfImageService.java:167)
        ... 68 common frames omitted
Caused by: java.util.zip.DataFormatException: invalid distance too far back
        at java.util.zip.Inflater.inflateBytes(Native Method)
        at java.util.zip.Inflater.inflate(Inflater.java:259)
        at java.util.zip.Inflater.inflate(Inflater.java:280)
        at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:83)
        at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:50)
        ... 82 common frames omitted
{noformat}

Looking further into `org.apache.pdfbox.filter.FlateFilter` :
* The underlying `DataFormatException` (= broken content that cannot be 
decompressed when reading the stream) is forwarded up *only* if nothing could 
be read from this stream
(see FlateFilter#decompress)
* The `DataFormatException` is wrapped unconditionally into an `IOException`.
(see FlateFilter#decode)

As a hack, swallowing `DataFormatException` in `FlateFilter#decode` makes 
things work. I get an error log but a thumbnail is correctly generated.




> PDF files with invalid compressed streams cannot be rendered
> ------------------------------------------------------------
>
>                 Key: PDFBOX-4781
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4781
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.18
>            Reporter: Arnaud Jeansen
>            Priority: Major
>
> I am using pdfbox 2.0.18 to generate thumbnails of PDF files.
> The code is basically as follows
> {code:java}
>     byte[] pdfFile = ...;
>     float dpi = 72L;
>     try (PDDocument pdfDocument = PDDocument.load(new 
> ByteArrayInputStream(pdfFile))) {
>       PDFRenderer pdfRenderer = new PDFRenderer(pdfDocument);
>       return pdfRenderer.renderImageWithDPI(0, dpi, ImageType.RGB);
>     } catch (IOException e) {
>       // Error handling
>     }
> {code}
> This works fine but for a few PDF files with an invalid compressed stream.
> Note: Thes PDF files open fine with a variety of PDF readers and java 
> libraries. Only pdfbox seems to fail on them.
> For those files, I get an error log "FlateFilter: stop reading corrupt stream 
> due to a DataFormatException" *and* an `IOException` with stacktrace
> {noformat}
> Caused by: java.io.IOException: java.util.zip.DataFormatException: invalid 
> distance too far back
>       at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:58)
>       at org.apache.pdfbox.filter.Filter.decode(Filter.java:87)
>       at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:84)
>       at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:175)
>       at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163)
>       at org.apache.pdfbox.pdmodel.PDPage.getContents(PDPage.java:170)
>       at 
> org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:92)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:499)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:483)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:156)
>       at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:269)
>       at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:321)
>       at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:243)
>       at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:229)
>       at 
> com.foocompany.service.PdfImageService.convertFromPdfBinaryToJpegBinary(PdfImageService.java:167)
>       ... 68 common frames omitted
> Caused by: java.util.zip.DataFormatException: invalid distance too far back
>       at java.util.zip.Inflater.inflateBytes(Native Method)
>       at java.util.zip.Inflater.inflate(Inflater.java:259)
>       at java.util.zip.Inflater.inflate(Inflater.java:280)
>       at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:83)
>       at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:50)
>       ... 82 common frames omitted
> {noformat}
> Looking further into `org.apache.pdfbox.filter.FlateFilter` :
> * The underlying `DataFormatException` (= broken content that cannot be 
> decompressed when reading the stream) is forwarded up *only* if nothing could 
> be read from this stream
> (see FlateFilter#decompress)
> * The `DataFormatException` is wrapped unconditionally into an `IOException`.
> (see FlateFilter#decode)
> As a hack, swallowing `DataFormatException` in `FlateFilter#decode` makes 
> things work. I get an error log but a thumbnail is correctly generated.
> I am not sure how to proceed from here. I am willing to write a patch but I 
> am not sure how to address this issue.
> I can also provide a PDF file that exhibits the problem privately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to