[
https://issues.apache.org/jira/browse/PDFBOX-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13999265#comment-13999265
]
Tilman Hausherr commented on PDFBOX-2079:
-----------------------------------------
The bug happens since rev 1585781, which fixed PDFBOX-2016, which was another
bug with lengths. I suspect that because of the sequential parsing, the correct
length wasn't available when reading the PDF, so we were reading "endstream"
(although the length is available downwards!). That length read was wrong
because of what you mentioned in the beginning.
I will need to find out why the sequential parser reads CR LF, whether this is
correct or not, and whether it can be changed.
Anyway, it shows once again that you shouldn't use load(). There's an
useNonSequentialParser config option in TIKA.
> Extra new line characters extracted in 1.8.5 for embedded files leading to
> ZipFile exception in Java 1.6
> --------------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-2079
> URL: https://issues.apache.org/jira/browse/PDFBOX-2079
> Project: PDFBox
> Issue Type: Bug
> Components: PDModel
> Affects Versions: 1.8.5, 1.8.6, 2.0.0
> Reporter: Tim Allison
> Assignee: Tilman Hausherr
> Priority: Minor
> Attachments: PDFBOX-2079-TEST_CASE.patch, embedded_zip.pdf
>
>
> For the test file I'll attach shortly, PDFBox 1.8.4 extracts 17660 bytes from
> an embedded zip (well, docx) file. PDFBox 1.8.5 extracts 17662 bytes --
> "\r\n" at the end of the stream. This leads to a ZipException for ZipFile(s)
> in Java 1.6, but not Java 1.7.
--
This message was sent by Atlassian JIRA
(v6.2#6252)