[ 
https://issues.apache.org/jira/browse/PDFBOX-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113398#comment-16113398
 ] 

Tilman Hausherr edited comment on PDFBOX-3887 at 8/4/17 3:07 PM:
-----------------------------------------------------------------

Something is messed up with that file... it starts with a wrong startxref 
value, then some bad stream lengths... it is able to recover from all that, 
until it hits an object stream and doesn't survive that. PDF.js can open the 
file, but gives up at page 8. A look at the file with NOTEPAD++ finds a lot 
weird spaces and CRs. -No, there is no quick solution.- I tried opening with 
Adobe Reader and resave it, but that brought an error. -I suspect that they 
could open it because they do parse on demand and we don't, i.e. they didn't 
hit the bad part immediately.- We parse the entire file. We're lenient on bad 
files, but this is not always successful.


was (Author: tilman):
Something is messed up with that file... it starts with a wrong startxref 
value, then some bad stream lengths... it is able to recover from all that, 
until it hits an object stream and doesn't survive that. PDF.js can open the 
file, but gives up at page 8. A look at the file with NOTEPAD++ finds a lot 
weird spaces and CRs. No, there is no quick solution. I tried opening with 
Adobe Reader and resave it, but that brought an error. I suspect that they 
could open it because they do parse on demand and we don't, i.e. they didn't 
hit the bad part immediately. We parse the entire file. We're lenient on bad 
files, but this is not always successful.

> Getting a "DataFormatException: invalid distance too far back" exception for 
> the attached file
> ----------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-3887
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3887
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.7
>         Environment: Windows 10 64-bit, Ubuntu 14.04 64-bit. 
> java version "1.8.0_141" 
> Java(TM) SE Runtime Environment (build 1.8.0_141-b15) 
> Java HotSpot(TM) 64-Bit Server VM (build 25.141-b15, mixed mode)
>            Reporter: Harun Reşit Zafer
>              Labels: extraction, parsing
>         Attachments: non-contract_00025.pdf
>
>
> PdfBox throws the following exception:
> {code:java}
> Caused by: java.io.IOException: java.util.zip.DataFormatException: invalid 
> distance too far back
>       at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:82)
>       at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:69)
>       at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:162)
>       at 
> org.apache.pdfbox.pdfparser.PDFObjectStreamParser.<init>(PDFObjectStreamParser.java:55)
>       at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectStream(COSParser.java:847)
>       at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:753)
>       at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:678)
>       at 
> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:638)
>       at 
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:236)
>       at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:271)
>       at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>       at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:940)
>       at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:888)
>       at 
> com.diligen.parser.pdf.PdfBoxHelper.getDocumentWithLineSegments(PdfBoxHelper.java:131)
>       ... 7 more
> Caused by: java.util.zip.DataFormatException: invalid distance too far back
>       at java.util.zip.Inflater.inflateBytes(Native Method)
>       at java.util.zip.Inflater.inflate(Inflater.java:259)
>       at java.util.zip.Inflater.inflate(Inflater.java:280)
>       at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:107)
>       at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:73)
>       ... 20 more
> {code}
> If there is no quick solution for this bug, is there a workaround? Can I 
> somehow catch the exception and take some action?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to