[ https://issues.apache.org/jira/browse/PDFBOX-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113398#comment-16113398 ]
Tilman Hausherr edited comment on PDFBOX-3887 at 8/4/17 3:07 PM: ----------------------------------------------------------------- Something is messed up with that file... it starts with a wrong startxref value, then some bad stream lengths... it is able to recover from all that, until it hits an object stream and doesn't survive that. PDF.js can open the file, but gives up at page 8. A look at the file with NOTEPAD++ finds a lot weird spaces and CRs. -No, there is no quick solution.- I tried opening with Adobe Reader and resave it, but that brought an error. -I suspect that they could open it because they do parse on demand and we don't, i.e. they didn't hit the bad part immediately.- We parse the entire file. We're lenient on bad files, but this is not always successful. was (Author: tilman): Something is messed up with that file... it starts with a wrong startxref value, then some bad stream lengths... it is able to recover from all that, until it hits an object stream and doesn't survive that. PDF.js can open the file, but gives up at page 8. A look at the file with NOTEPAD++ finds a lot weird spaces and CRs. No, there is no quick solution. I tried opening with Adobe Reader and resave it, but that brought an error. I suspect that they could open it because they do parse on demand and we don't, i.e. they didn't hit the bad part immediately. We parse the entire file. We're lenient on bad files, but this is not always successful. > Getting a "DataFormatException: invalid distance too far back" exception for > the attached file > ---------------------------------------------------------------------------------------------- > > Key: PDFBOX-3887 > URL: https://issues.apache.org/jira/browse/PDFBOX-3887 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 2.0.7 > Environment: Windows 10 64-bit, Ubuntu 14.04 64-bit. > java version "1.8.0_141" > Java(TM) SE Runtime Environment (build 1.8.0_141-b15) > Java HotSpot(TM) 64-Bit Server VM (build 25.141-b15, mixed mode) > Reporter: Harun Reşit Zafer > Labels: extraction, parsing > Attachments: non-contract_00025.pdf > > > PdfBox throws the following exception: > {code:java} > Caused by: java.io.IOException: java.util.zip.DataFormatException: invalid > distance too far back > at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:82) > at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:69) > at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:162) > at > org.apache.pdfbox.pdfparser.PDFObjectStreamParser.<init>(PDFObjectStreamParser.java:55) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectStream(COSParser.java:847) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:753) > at > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:678) > at > org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:638) > at > org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:236) > at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:271) > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984) > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:940) > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:888) > at > com.diligen.parser.pdf.PdfBoxHelper.getDocumentWithLineSegments(PdfBoxHelper.java:131) > ... 7 more > Caused by: java.util.zip.DataFormatException: invalid distance too far back > at java.util.zip.Inflater.inflateBytes(Native Method) > at java.util.zip.Inflater.inflate(Inflater.java:259) > at java.util.zip.Inflater.inflate(Inflater.java:280) > at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:107) > at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:73) > ... 20 more > {code} > If there is no quick solution for this bug, is there a workaround? Can I > somehow catch the exception and take some action? -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org