[
https://issues.apache.org/jira/browse/PDFBOX-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adam Nichols updated PDFBOX-796:
--------------------------------
Issue Type: Improvement (was: Bug)
Due Date: 27/Aug/10 (was: 20/Aug/10)
> Objects from streams overwrite objects already read with the same
> ID/Generation
> -------------------------------------------------------------------------------
>
> Key: PDFBOX-796
> URL: https://issues.apache.org/jira/browse/PDFBOX-796
> Project: PDFBox
> Issue Type: Improvement
> Components: Parsing
> Environment: 32-bit Windows Vista, Java 1.5, PDFBox head tag
> Reporter: Adam Nichols
> Assignee: Adam Nichols
> Fix For: 1.3.0
>
> Attachments: PDFBOX-796.patch
>
>
> When trying to merge some documents (using the PDFMergerUtility class) I got
> a NullPointerException and the merge failed. I traced through to eventually
> discover that some objects were being overwritten when the PDFParser called
> document.dereferenceObjectStreams(); (line 207 of PDFParser.java).
> Having multiple objects with the same object ID is a violation of the PDF
> specification, so how this should be dealt with is undefined. The "use the
> first object" mentality enabled my file to be processed and it is consistent
> with the other code in PDFBox. For another example of where PDFBox deals
> with reading in an object which already exists, you can see PDFParser (on
> line 541) checks to see if the object has already been read and put in the
> pool. If not, it adds it to the list of conflicts. Later, when
> resolveConflicts() is called, it overwrites the object only if it's
> specifically referenced in the xref table. This is a reasonable way to
> resolve conflicts because if the object isn't in the xref table, it is likely
> the wrong one.
> Since we're reading from a stream of compressed data, we can not give a
> particular byte offset. This means we can't add these conflicts to the
> conflict list and try to determine if this object is legitimate or not. It's
> best to use the data we've already read, as using the one from the stream has
> been confirmed to cause problems. I've done regression testing with other
> files which have this problem, including the file from PDFBOX-720 and have
> not seen any issues.
> Unfortunately I can not provide the PDF which demonstrates this problem and
> solution as it contains information I'm not authorized to release.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.