[ 
https://issues.apache.org/jira/browse/PDFBOX-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13999265#comment-13999265
 ] 

Tilman Hausherr commented on PDFBOX-2079:
-----------------------------------------

The bug happens since rev 1585781, which fixed PDFBOX-2016, which was another 
bug with lengths. I suspect that because of the sequential parsing, the correct 
length wasn't available when reading the PDF, so we were reading "endstream" 
(although the length is available downwards!). That length read was wrong 
because of what you mentioned in the beginning. 

I will need to find out why the sequential parser reads CR LF, whether this is 
correct or not, and whether it can be changed.

Anyway, it shows once again that you shouldn't use load(). There's an 
useNonSequentialParser config option in TIKA.

> Extra new line characters extracted in 1.8.5 for embedded files leading to 
> ZipFile exception in Java 1.6
> --------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-2079
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2079
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 1.8.5, 1.8.6, 2.0.0
>            Reporter: Tim Allison
>            Assignee: Tilman Hausherr
>            Priority: Minor
>         Attachments: PDFBOX-2079-TEST_CASE.patch, embedded_zip.pdf
>
>
> For the test file I'll attach shortly, PDFBox 1.8.4 extracts 17660 bytes from 
> an embedded zip (well, docx) file.  PDFBox 1.8.5 extracts 17662 bytes -- 
> "\r\n" at the end of the stream.  This leads to a ZipException for ZipFile(s) 
> in Java 1.6, but not Java 1.7. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to