[
https://issues.apache.org/jira/browse/PDFBOX-2976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andreas Lehmkühler updated PDFBOX-2976:
---------------------------------------
Attachment: PDFBOX2976_FlateFilter.patch
The garbage at the beginning is not a problem as the self repair mechanism of
the parser simply ignores it.
But the content streams are all broken and the flate filter throws an exception
which leads the parser to stop working. I guess that behaviour is probably new,
former version swallowed that exception.
I've attached a patch which swallows the DataFormatExcpetion and reduces one of
the buffer sizes to decompress as much as possible of the data.
> java.util.zip.DataFormatException: incorrect data check
> -------------------------------------------------------
>
> Key: PDFBOX-2976
> URL: https://issues.apache.org/jira/browse/PDFBOX-2976
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 2.0.0
> Environment: Linux Mint 17.2 x64, JDK7u79, Glassfish 3.1.2.2
> Reporter: Felix Rudolphi
> Attachments: PDFBOX2976_FlateFilter.patch, sc-356376(1)-x.pdf,
> sc-356376(1).pdf, sc-356376-x.pdf, sc-356376.pdf
>
> Original Estimate: 3h
> Remaining Estimate: 3h
>
> When trying to open certain PDF files (examples attached, also any MSDS
> available at http://www.scbt.com/datasheet-356376.html ), an expection is
> thrown resulting in the file not being parsed:
> java.io.IOException: java.util.zip.DataFormatException: incorrect data check
> at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83)
> at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:78)
> at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:160)
> at
> org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:143)
> at org.apache.pdfbox.pdmodel.PDPage.getContents(PDPage.java:148)
> at
> org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:92)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:450)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:437)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:148)
> at
> org.apache.pdfbox.text.PDFTextStreamEngine.processPage(PDFTextStreamEngine.java:117)
> at
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:367)
> at
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:303)
> at
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:248)
> at
> org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:209)
> -- or --
> java.io.IOException: java.util.zip.DataFormatException: incorrect data check
> at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83)
> at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:78)
> at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:160)
> at
> org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:143)
> at org.apache.pdfbox.pdmodel.PDPage.getContents(PDPage.java:148)
> at
> org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:92)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:450)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:437)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:148)
> at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:179)
> at
> org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205)
> at
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]